2024-08-23T03:53:17.521Z | <Vallari Agrawal> It could be a DNS issue? On smithi049, it gives me 502 Bad Gateway:
```[vallariag@smithi049 ~]$ podman pull [quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49](http://quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49)
Trying to pull [quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49](http://quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49)...
Error: parsing image configuration: fetching blob: received unexpected HTTP status: 502 Bad Gateway```
But it works okay on my local machine with sepia VPN connected:
```✗ docker pull [quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49](http://quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49)
5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49: Pulling from ceph-ci/ceph
e8b54c863393: Pulling fs layer
db43c217a2c8: Pulling fs layer ``` |
2024-08-23T06:38:15.083Z | <Sunil Angadi> Hi team,
in PR build getting failed for `centos` distros
<https://shaman.ceph.com/builds/ceph/wip-sangadi1-testing-2024-08-22-1458-quincy/>
can somebody please help me to root cause the issue
[https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAIL[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/82350//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/82350//consoleFull)
what action makes me to get build success? |
2024-08-23T07:19:11.472Z | <Sunil Angadi> @Sayalee Raut this might be the reason |
2024-08-23T07:25:27.863Z | <Sayalee Raut> Hey @Sunil Angadi Yes, my shaman build failed yesterday late evening, but now since quay is back up, I will try rebuilding it.Thanks! |
2024-08-23T07:26:56.497Z | <Dan Mick> ah. there is a problem with [quay.ceph.io](http://quay.ceph.io)'s backing store. The web UI was probably working but actual bulk file operations probably were not. Let me work on it. |
2024-08-23T08:22:29.528Z | <Sayalee Raut> Okay, thanks @Dan Mick |
2024-08-23T12:48:43.023Z | <Yaarit> @Dan Mick thanks, I see it is working |
2024-08-23T14:41:57.101Z | <yuriw> I see several c9 failed for quincy as well, and @Sayalee Raut build
@Dan Mick @Laura Flores ^ |
2024-08-23T14:42:19.944Z | <yuriw> I see several c9 failed for quincy as well, and @Sayalee Raut build
@Dan Mick @Laura Flores ^ |
2024-08-23T14:42:48.180Z | <yuriw> [https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/82347//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/82347//consoleFull)
I see:
curl: (22) The requested URL returned error: 404
error: skipping <https://3.chacra.ceph.com/r/ceph/wip-saraut02-testing-2024-08-22-1312/cd4c799c9219dc7debfb16701b85f090fcc7074d/centos/9/flavors/default//noarch/ceph-release-1-0.el9.noarch.rpm> - transfer failed |
2024-08-23T14:51:19.791Z | <Casey Bodley> can we consider removing these nodes from pr checks in the meantime? |
2024-08-23T16:52:50.480Z | <Laura Flores> So, for the centos distros, there is an extra step in the job after building packages where a container image is created and pushed to [quay.ceph.io](http://quay.ceph.io). Dan Mick mentioned that [quay.ceph.io](http://quay.ceph.io) is somewhat broken after upgrading the log running cluster (a cluster we have which hosts many of our services including [quay.ceph.io](http://quay.ceph.io)) two days ago, so builds will break until that is fixed. |
2024-08-23T16:53:11.713Z | <Laura Flores> So @Sunil Angadi you aren't doing anything wrong with how you built the branch; it's an external issue. |
2024-08-23T16:57:10.701Z | <Dan Mick> OK, so, rgw has issues with the rgw.quay service |
2024-08-23T16:58:10.343Z | <Dan Mick> They seem very basic but I need help with where to start. @Casey Bodley can you spare some time this morning |
2024-08-23T16:58:30.583Z | <Casey Bodley> i'm all yours |
2024-08-23T16:58:52.074Z | <Dan Mick> so the service was set up with cephadm AFAIK: |
2024-08-23T16:59:07.123Z | <Dan Mick> # ceph orch ls --service-name rgw.quay --export
service_type: rgw
service_id: quay
service_name: rgw.quay
placement:
count: 2
spec:
rgw_frontend_type: beast |
2024-08-23T17:00:21.156Z | <Dan Mick> the radosgw's got restarted, and fail immediately. q1) how is one supposed to find the logs for such a failing service? How does one even know which hosts were chosen for it to run last, canonically? |
2024-08-23T17:00:43.253Z | <Dan Mick> cephadm health detail seems to help, showing reesi002 and reesi006 |
2024-08-23T17:01:28.413Z | <Dan Mick> on reesi002, the service status shows |
2024-08-23T17:02:22.945Z | <Dan Mick> # systemctl -l --no-pager status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service
× ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service - Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a
Loaded: loaded (/etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service; enabled; vendor preset: enabled)
Active: failed (Result: protocol) since Fri 2024-08-23 08:06:23 UTC; 8h ago
CPU: 1.483s
Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Scheduled restart job, restart counter is at 5.
Aug 23 08:06:23 reesi002 systemd[1]: Stopped Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.
Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Consumed 1.483s CPU time.
Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Start request repeated too quickly.
Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Failed with result 'protocol'.
Aug 23 08:06:23 reesi002 systemd[1]: Failed to start Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.
Aug 23 14:52:52 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Aug 23 14:52:54 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Aug 23 14:52:55 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed. |
2024-08-23T17:02:43.376Z | <Dan Mick> (nothing useful, IOW) |
2024-08-23T17:03:04.229Z | <Dan Mick> # systemctl -l --no-pager status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service
× ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service - Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a
Loaded: loaded (/etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service; enabled; vendor preset: enabled)
Active: failed (Result: protocol) since Fri 2024-08-23 08:06:23 UTC; 8h ago
CPU: 1.483s
Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Scheduled restart job, restart counter is at 5.
Aug 23 08:06:23 reesi002 systemd[1]: Stopped Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.
Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Consumed 1.483s CPU time.
Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Start request repeated too quickly.
Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Failed with result 'protocol'.
Aug 23 08:06:23 reesi002 systemd[1]: Failed to start Ceph `rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.`
`Aug 23 14:52:52 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.`
`Aug 23 14:52:54 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.`
`Aug 23 14:52:55 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.` |
2024-08-23T17:03:46.820Z | <Dan Mick> `# systemctl -l --no-pager status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service`
`× ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service - Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a`
`Loaded: loaded (/etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service; enabled; vendor preset: enabled)`
`Active: failed (Result: protocol) since Fri 2024-08-23 08:06:23 UTC; 8h ago`
`CPU: 1.483s`
`Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Scheduled restart job, restart counter is at 5.`
`Aug 23 08:06:23 reesi002 systemd[1]: Stopped Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.`
`Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Consumed 1.483s CPU time.`
`Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Start request repeated too quickly.`
`Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Failed with result 'protocol'.`
`Aug 23 08:06:23 reesi002 systemd[1]: Failed to start Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.`
`Aug 23 14:52:52 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.`
`Aug 23 14:52:54 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.`
`Aug 23 14:52:55 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.` |
2024-08-23T17:04:50.044Z | <Dan Mick> journalctl -u is slightly more verbose: |
2024-08-23T17:05:07.830Z | <Dan Mick> deferred set uid:gid to 167:167 (ceph:ceph)
]: ceph version 19.1.1 (1d9f35852eef16b81614e38a05cf88b505cc142b) squid (rc), p>
]: framework: beast
]: framework conf key: port, val: 80
]: rgw main: failed to load zone: (2) No such file or directory |
2024-08-23T17:05:34.444Z | <Casey Bodley> hm mok |
2024-08-23T17:05:34.767Z | <Dan Mick> the failed to load zone seems useful. I don't see anything like a zone name in the error message, and I don't know where to go from here. |
2024-08-23T17:06:16.936Z | <Dan Mick> AFAIK zones weren't ever configured |
2024-08-23T17:06:33.781Z | <Dan Mick> so it's probably whatever defaults to a non-multi config, but... |
2024-08-23T17:07:19.547Z | <Casey Bodley> my understanding is that cephadm creates a realm/zonegroup/zone for each service. can you confirm that @Adam King? |
2024-08-23T17:09:42.726Z | <Dan Mick> is that configuration something that can be examined somehow? radosgw-admin seemed to fail with a similar "can't help you" error message for the few dumb things I tried |
2024-08-23T17:10:27.985Z | <Casey Bodley> i ssh'ed in but radosgw-admin isn't installed there |
2024-08-23T17:10:33.183Z | <Casey Bodley> i tried 'radosgw-admin zone list' |
2024-08-23T17:10:40.168Z | <Dan Mick> well it's in the container I assume |
2024-08-23T17:11:02.276Z | <Adam King> You can have a zone created for you if you run the rgw spec through the `ceph rgw realm bootstrap -i <spec>` command, but specs just passed to the orchestrator don't create realms/zones/zonegroups. |
2024-08-23T17:11:49.342Z | <Casey Bodley> ok, thanks. so it's probably just expecting a zone/zonegroup named 'default' |
2024-08-23T17:12:57.021Z | <Dan Mick> Note that with cephadm, radosgw daemons are configured via the monitor configuration database instead of via a ceph.conf or the command line. If that configuration isn’t already in place (usually in the `client.rgw.<something>` section), then the radosgw daemons will start up with default settings (e.g., binding to port 80). |
2024-08-23T17:13:18.447Z | <Dan Mick> says the doc. so there ought to be remnants in one of the mon config mechanismss? |
2024-08-23T17:13:59.244Z | <Dan Mick> ceph config dump shows some, but nothing about zones/realms |
2024-08-23T17:14:04.929Z | <Casey Bodley> `rgw_zone` is the relevant config option |
2024-08-23T17:14:39.697Z | <Dan Mick> is there a way to examine the pools to discover the zone name |
2024-08-23T17:15:14.323Z | <Casey Bodley> the rgw pools should have a prefix like {zone-name}.rgw.* |
2024-08-23T17:15:38.756Z | <Dan Mick> 94 default.rgw.control
95 default.rgw.meta
96 default.rgw.log
97 default.rgw.buckets.index
98 default.rgw.buckets.data
99 default.rgw.buckets.non-ec |
2024-08-23T17:16:28.302Z | <Dan Mick> so is is possible that radosgw changed to not assume 'default', and now requires explicitly naming the zone in config? |
2024-08-23T17:17:26.115Z | <Casey Bodley> if rgw_zone is empty and there is no rgw_realm configured, it should read the zone named "default" and create it if it doesn't exist |
2024-08-23T17:19:35.919Z | <Casey Bodley> so "failed to load zone: (2) No such file or directory" is confusing |
2024-08-23T17:19:39.027Z | <Dan Mick> the end of ceph config-key ls | grep rgw |
2024-08-23T17:19:49.945Z | <Dan Mick> "config/client.rgw.cephadmin3.rgw0/debug_rgw",
"config/client.rgw.quay.reesi002.anxbdb/container_image",
"config/client.rgw.quay.reesi002.anxbdb/debug_ms",
"config/client.rgw.quay.reesi002.anxbdb/debug_rgw",
"config/client.rgw.quay.reesi002.anxbdb/rgw_frontends",
"config/client.rgw.quay.reesi006.zmjsox/container_image",
"config/client.rgw.quay.reesi006.zmjsox/rgw_frontends",
"config/client.rgw.rgw.quay.reesi002.anxbdb/debug_ms",
"config/client.rgw.rgw.quay.reesi002.anxbdb/debug_rgw",
"config/client.rgw.rgw.quay.reesi002.anxbdb/log_to_file",
"config/client.rgw.rgw.quay.reesi006.zmjsox/debug_rgw",
"config/client.rgw.rgw.quay.reesi006.zmjsox/log_to_file",
"config/client.rgw/debug_rgw",
"config/client.rgw/log_to_file",
"mgr/cephadm/rgw_migration_queue",
"mgr/cephadm/spec.rgw.quay",
"rgw/cert/rgw.quay" |
2024-08-23T17:21:29.929Z | <Dan Mick> I tried an strace, but it seems like everything it's doing is rados ops so that wasn't much help. I tried to add debug_rgw=30 to the cmdline but that didn't seem to help either |
2024-08-23T17:22:25.541Z | <Casey Bodley> debug-ms=1 would log the individual rados ops |
2024-08-23T17:24:48.587Z | <Dan Mick> if you think that's worth a try I can try it |
2024-08-23T17:26:08.661Z | <Casey Bodley> yes please |
2024-08-23T17:27:15.448Z | <Casey Bodley> or can you capture the output of `radosgw-admin realm default`? |
2024-08-23T17:27:34.933Z | <Casey Bodley> wait no |
2024-08-23T17:28:01.504Z | <Casey Bodley> `radosgw-admin realm list` would show both |
2024-08-23T17:28:34.080Z | <Dan Mick> where would I expect log messages from debug_ms to show up? |
2024-08-23T17:28:56.352Z | <Casey Bodley> journalctl i think? |
2024-08-23T17:29:51.420Z | <Dan Mick> ok yeah |
2024-08-23T17:30:51.200Z | <Dan Mick> Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.109 v2:172.21.2.221:6832/4215531678 1 ==== osd_op_reply(3 default.zone.87194b64-d43d-470f-a491-a67115d255a7 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) ==== 193+0+0 (crc 0 0 0) 0x555da573c780 con 0x555da5774400 |
2024-08-23T17:30:59.703Z | <Dan Mick> is right before the "no such file" generic message |
2024-08-23T17:31:24.583Z | <Casey Bodley> do you see anything about default.realm above that? |
2024-08-23T17:32:00.889Z | <Dan Mick> Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.78 v2:172.21.2.224:6850/2039386015 1 ==== osd_op_reply(1 default.realm [read 0~46 out=46b] v0'0 uv2 ondisk = 0) ==== 157+0+46 (crc 0 0 0) 0x555da573c
280 con 0x555da49aac00 |
2024-08-23T17:32:09.038Z | <Casey Bodley> oh, that default.zone.{realm-id} implies that it did find a default realm |
2024-08-23T17:32:09.413Z | <Dan Mick> ISTR there's some way to match replies to requests |
2024-08-23T17:32:37.944Z | <Dan Mick> is it 172.21.2.202:0/2063761546 maybe |
2024-08-23T17:32:48.697Z | <Dan Mick> nah, that's probably just the client addr/nonce |
2024-08-23T17:35:39.739Z | <Casey Bodley> ok, so something at some point created a realm and set is as the default, which is 'sticky' |
2024-08-23T17:35:49.847Z | <Dan Mick> this might be the whole trace |
2024-08-23T17:36:01.943Z | <Dan Mick> `Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 --> [v2:172.21.2.224:6850/2039386015,v1:172.21.2.224:6851/2039386015] -- osd_op(unknown.0.0:1 93.12 93:49953fa1:::default.realm:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e10788506) -- 0x555da5aa9800 con 0x555da49aac00`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6850/2039386015,v1:172.21.2.224:6851/2039386015] conn(0x555da49aac00 0x555da4a1cb00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6850/2039386015,v1:172.21.2.224:6851/2039386015] conn(0x555da49aac00 0x555da4a1cb00 crc :-1 s=READY pgs=1175 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.78 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.78 v2:172.21.2.224:6850/2039386015 1 ==== osd_op_reply(1 default.realm [read 0~46 out=46b] v0'0 uv2 ondisk = 0) ==== 157+0+46 (crc 0 0 0) 0x555da573c280 con 0x555da49aac00`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6874/3034296905,v1:172.21.2.224:6875/3034296905] conn(0x555da5aa9c00 0x555da4a1c580 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).connect`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 --> [v2:172.21.2.224:6874/3034296905,v1:172.21.2.224:6875/3034296905] -- osd_op(unknown.0.0:2 93.6 93:610446fc:::realms.87194b64-d43d-470f-a491-a67115d255a7:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e10788506) -- 0x555da5774000 con 0x555da5aa9c00`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6874/3034296905,v1:172.21.2.224:6875/3034296905] conn(0x555da5aa9c00 0x555da4a1c580 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6874/3034296905,v1:172.21.2.224:6875/3034296905] conn(0x555da5aa9c00 0x555da4a1c580 crc :-1 s=READY pgs=1044 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.79 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.79 v2:172.21.2.224:6874/3034296905 1 ==== osd_op_reply(2 realms.87194b64-d43d-470f-a491-a67115d255a7 [call out=48b,read 0~107 out=107b] v0'0 uv3 ondisk = 0) ==== 229+0+155 (crc 0 0 0) 0x555da573c500 con 0x555da5aa9c00`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.221:6832/4215531678,v1:172.21.2.221:6833/4215531678] conn(0x555da5774400 0x555da4a1e680 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).connect`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 --> [v2:172.21.2.221:6832/4215531678,v1:172.21.2.221:6833/4215531678] -- osd_op(unknown.0.0:3 93.14 93:2d6507ee:::default.zone.87194b64-d43d-470f-a491-a67115d255a7:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e10788506) -- 0x555da5774800 con 0x555da5774400`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.221:6832/4215531678,v1:172.21.2.221:6833/4215531678] conn(0x555da5774400 0x555da4a1e680 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.221:6832/4215531678,v1:172.21.2.221:6833/4215531678] conn(0x555da5774400 0x555da4a1e680 crc :-1 s=READY pgs=1213 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.109 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.109 v2:172.21.2.221:6832/4215531678 1 ==== osd_op_reply(3 default.zone.87194b64-d43d-470f-a491-a67115d255a7 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) ==== 193+0+0 (crc 0 0 0) 0x555da573c780 con 0x555da5774400`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: rgw main: failed to load zone: (2) No such file or directory`
`Aug 23 17:26:58 reesi002 radosgw[1833516]: Couldn't init storage provider (RADOS)` |
2024-08-23T17:36:22.204Z | <Dan Mick> I'll try that radosgw-admin command |
2024-08-23T17:37:23.383Z | <Dan Mick> root@reesi001:/usr/share# radosgw-admin realm list
{
"default_info": "87194b64-d43d-470f-a491-a67115d255a7",
"realms": [
"default"
]
} |
2024-08-23T17:39:22.101Z | <Casey Bodley> i think the simplest workaround is to set `rgw_zone = default` in the mon config. i'm not positive about the exact syntax, but it would look something like `ceph config set client.rgw.rgw.quay rgw_zone default` |
2024-08-23T17:39:38.745Z | <Dan Mick> root@reesi001:/usr/share# radosgw-admin zone list
{
"default_info": "",
"zones": [
"default"
]
} |
2024-08-23T17:40:02.440Z | <Casey Bodley> `client.rgw.quay` also shows up in that config-key output |
2024-08-23T17:40:36.658Z | <Dan Mick> are there default zone/realm names, and are they "default"? You're implying 'no' |
2024-08-23T17:41:12.545Z | <Dan Mick> and, if we change it, maybe we should change it with a cephadm service ... manifest |
2024-08-23T17:41:16.467Z | <Dan Mick> or whatever it's called |
2024-08-23T17:42:54.077Z | <Casey Bodley> "default" is overloaded here, but something created a realm with the `--default` option which means that everything now tries to run inside that realm |
2024-08-23T17:43:22.839Z | <Casey Bodley> for zones and zonegroups, there's fallback behavior to use the name "default" if nothing is specifically configured |
2024-08-23T17:43:36.475Z | <Dan Mick> ....but the normal state of realm is "not specified"? |
2024-08-23T17:43:48.409Z | <Dan Mick> (for a non-realm config) |
2024-08-23T17:44:51.357Z | <Casey Bodley> the normal state would be no realms exist |
2024-08-23T17:45:03.594Z | <Dan Mick> (I'm also trying to figure out if this is a behavior change that will catch someone else) |
2024-08-23T17:45:58.376Z | <Dan Mick> I'd be very surprised if any human specifically specified a domain for this. I can't be sure (multiple admins) but I'm 99.5% sure the service was configured only with the cephadm service description shown above. So I'm concerned that whatever cephadm did left a state that radosgw is now puking on in the transition from v18-v19 |
2024-08-23T17:46:13.051Z | <Dan Mick> I'd be very surprised if any human specifically specified a realm for this. I can't be sure (multiple admins) but I'm 99.5% sure the service was configured only with the cephadm service description shown above. So I'm concerned that whatever cephadm did left a state that radosgw is now puking on in the transition from v18-v19 |
2024-08-23T17:46:17.888Z | <Casey Bodley> earlier in the release process we documented this in the release notes, but it's not the same issue:
> * rgw: On startup, radosgw and radosgw-admin now validate the ``rgw_realm``
> config option. Previously, they would ignore invalid or missing realms and
> go on to load a zone/zonegroup in a different realm. If startup fails with
> a "failed to load realm" error, fix or remove the ``rgw_realm`` option. |
2024-08-23T17:47:17.855Z | <Dan Mick> but, where would something have "created a realm with the --default option"? |
2024-08-23T17:47:32.364Z | <Dan Mick> would that not be in the mon config? |
2024-08-23T17:48:00.353Z | <Dan Mick> i.e. what is rgw trying to validate? I don't see a specification |
2024-08-23T17:48:45.432Z | <Casey Bodley> rgw wouldn't have done that itself. possible that someone tried `ceph rgw realm bootstrap -i <spec>`? i don't know if that adds `--default`, but it shouldn't |
2024-08-23T17:49:16.429Z | <Dan Mick> I really doubt it. All the users of that rgw instance are likely unaware that realms even exist |
2024-08-23T17:49:27.028Z | <Casey Bodley> my recollection is that people hit that "failed to load realm" error when they accidentally included an `rgw_realm` field in their rgw service spec |
2024-08-23T17:49:44.202Z | <Casey Bodley> before squid that didn't cause an error |
2024-08-23T17:49:57.279Z | <Dan Mick> but even if they did something like ceph rgw realm, where would that leave info if not the mon config? |
2024-08-23T17:50:32.126Z | <Casey Bodley> it would create rados objects that would be visible to the `radosgw-admin realm list` command |
2024-08-23T17:52:03.310Z | <Dan Mick> so we have a situation where radosgw now does something with rgw_realm in the config; since that doesn't exist, what does it do? |
2024-08-23T17:52:19.559Z | <Dan Mick> sorry to belabor this, I just want to understand the theory to understand the wider impact |
2024-08-23T17:55:12.302Z | <Casey Bodley> if the `rgw_realm` option is not configured, it looks for the rados object `default.realm` and, if found, loads that realm |
2024-08-23T17:55:43.061Z | <Dan Mick> can I look for that object? its ctime/mtime might be useful. what pool? |
2024-08-23T17:56:11.885Z | <Casey Bodley> should be in `rgw.root` |
2024-08-23T17:56:20.187Z | <Dan Mick> .rgw.root? |
2024-08-23T17:56:41.873Z | <Dan Mick> (leading .) |
2024-08-23T17:57:14.545Z | <Casey Bodley> right |
2024-08-23T17:57:54.530Z | <Dan Mick> ($87194b64-d43d-470f-a491-a67115d255a7 |
2024-08-23T17:58:26.748Z | <Dan Mick> (which, save for the leading chars, is what realm list shows) |
2024-08-23T17:59:06.126Z | <Casey Bodley> that's the realm id, yeah |
2024-08-23T17:59:19.462Z | <Dan Mick> modified 2/22/2024 |
2024-08-23T18:00:20.334Z | <Dan Mick> so...what's rgw failing to find? |
2024-08-23T18:01:21.620Z | <Casey Bodley> because it finds a realm, it tries to load the realm's default zone from the rados object `default.zone.87194b64-d43d-470f-a491-a67115d255a7` but finds none |
2024-08-23T18:01:37.918Z | <Casey Bodley> if it hadn't found the realm, it would fall back to zone/zonegroups named "default" |
2024-08-23T18:02:04.852Z | <Casey Bodley> so something created a realm, but didn't put anything in it |
2024-08-23T18:02:15.583Z | <Dan Mick> fwiw: |
2024-08-23T18:02:16.500Z | <Dan Mick> root@reesi001:/usr/share# radosgw-admin realm get-default
default realm: 87194b64-d43d-470f-a491-a67115d255a7 |
2024-08-23T18:02:33.343Z | <Dan Mick> and, ok |
2024-08-23T18:04:16.303Z | <Dan Mick> indeed, there is no default.zone.8... |
2024-08-23T18:04:23.832Z | <Dan Mick> there is a default.zonegroup.8.... |
2024-08-23T18:04:39.193Z | <Casey Bodley> yeah that was in the debug-ms=1 output |
2024-08-23T18:04:58.329Z | <Dan Mick> ugh, can I really not rados get to stdout |
2024-08-23T18:05:40.274Z | <Dan Mick> (well I can specify /dev/stdout as the file) |
2024-08-23T18:06:23.900Z | <Dan Mick> # rados -p .rgw.root ls | sort
default.realm
default.zonegroup.87194b64-d43d-470f-a491-a67115d255a7
period_config.87194b64-d43d-470f-a491-a67115d255a7
periods.87194b64-d43d-470f-a491-a67115d255a7:staging
periods.87194b64-d43d-470f-a491-a67115d255a7:staging.latest_epoch
periods.87254aaf-7abe-4c13-bc70-27b1976ac684.1
periods.87254aaf-7abe-4c13-bc70-27b1976ac684.2
periods.87254aaf-7abe-4c13-bc70-27b1976ac684.3
periods.87254aaf-7abe-4c13-bc70-27b1976ac684.4
periods.87254aaf-7abe-4c13-bc70-27b1976ac684.5
periods.87254aaf-7abe-4c13-bc70-27b1976ac684.6
periods.87254aaf-7abe-4c13-bc70-27b1976ac684.latest_epoch
periods.c752bb6f-a9bf-4160-af9f-da1646fa42de.1
periods.c752bb6f-a9bf-4160-af9f-da1646fa42de.latest_epoch
realms.87194b64-d43d-470f-a491-a67115d255a7
realms.87194b64-d43d-470f-a491-a67115d255a7.control
realms_names.default
zonegroup_info.d3419277-bb38-437f-a610-1aa77b0989df
zonegroups_names.default
zone_info.0332d5b4-b7b6-4a18-ad00-aa1a16c66256
zone_names.default |
2024-08-23T18:07:26.041Z | <Dan Mick> zone_names.default contains 033.. |
2024-08-23T18:08:34.055Z | <Casey Bodley> wow. from the periods.* objects i take that to mean that someone created two other realms with ids `c752bb6f-a9bf-4160-af9f-da1646fa42de` and `87254aaf-7abe-4c13-bc70-27b1976ac684` and deleted them |
2024-08-23T18:08:34.455Z | <Dan Mick> all of those objects are from 02/22 |
2024-08-23T18:09:22.821Z | <Dan Mick> except for realms.87194b64-d43d-470f-a491-a67115d255a7.control, from 8/21 |
2024-08-23T18:09:34.191Z | <Dan Mick> and zone_names.default, from 2018 (!) |
2024-08-23T18:10:50.787Z | <Dan Mick> (you have sudo on reesi001 which is where I'm doing this, btw) |
2024-08-23T18:15:05.649Z | <Dan Mick> so I'm coming back to "this was a working config before the upgrade, even if it had hinky stuff in the rados objects/mon config". Is there a transition that others may need to adjust to? (and I'll point out that I didn't even glance at release notes, which it now occurs to me is probably a bad idea) |
2024-08-23T18:18:26.076Z | <Casey Bodley> yeah i think we'll need to come up with another release note for this |
2024-08-23T18:18:27.509Z | <Dan Mick> (and when I say "from" I'm talking about mtime from rados stat2) |
2024-08-23T18:18:37.993Z | <Casey Bodley> i'll discuss with the team |
2024-08-23T18:19:03.048Z | <Casey Bodley> for now i'm considering a `sudo rados -p .rgw.root rm default.realm` to get lrc going again |
2024-08-23T18:19:28.536Z | <Casey Bodley> any objection? i'm not sure the `ceph config set` for rgw_zone i suggested before will work |
2024-08-23T18:20:43.629Z | <Dan Mick> so if default.realm doesn't exist, it'll use realm name 'default' and open zone_names.default?.. |
2024-08-23T18:21:54.292Z | <Casey Bodley> if rgw_realm is empty and `default.realm` doesn't exist and rgw_zone is empty, it will load `zone_names.default` |
2024-08-23T18:23:14.308Z | <Casey Bodley> <https://github.com/ceph/ceph/blob/squid/src/rgw/driver/rados/rgw_zone.cc#L1192-L1223> is the algorithm i'm looking at |
2024-08-23T18:23:18.607Z | <Dan Mick> so is this a theory: config probably got a specific realm **named** default created at some point, id 8719... |
2024-08-23T18:23:39.760Z | <Dan Mick> evidenced by the 2/22 default.realm and the periods.8719.. files |
2024-08-23T18:24:21.215Z | <Dan Mick> that was innocuous until 19.1.1 when rgw started caring about default.realm and trying to use it as the base for finding the zone, and it can't |
2024-08-23T18:24:47.913Z | <Dan Mick> well. started caring that the contents of default.realm were valid |
2024-08-23T18:25:24.801Z | <Dan Mick> I'll collect copies of all the objects in .rgw.root before changing anything |
2024-08-23T18:26:15.404Z | <Dan Mick> in /root/rgw |
2024-08-23T18:27:33.570Z | <Dan Mick> does that theory of the failure sound right to you? |
2024-08-23T18:30:54.379Z | <Dan Mick> (while you ponder that I'll remove default.realm, since I have a copy, and try restarting 002 |
2024-08-23T18:31:58.915Z | <Dan Mick> well it came up |
2024-08-23T18:32:07.250Z | <Dan Mick> trying 006 |
2024-08-23T18:33:12.082Z | <Dan Mick> it came up too |
2024-08-23T18:33:19.123Z | <Casey Bodley> wonderful |
2024-08-23T18:33:43.343Z | <Dan Mick> cephadm hasn't noticed yet but yeah, good calls |
2024-08-23T18:35:13.015Z | <Dan Mick> trying ceph orch ps --refresh to get it to notice |
2024-08-23T18:36:08.334Z | <Dan Mick> orch ls --refresh, or time, shows it 2/2 now |
2024-08-23T18:36:11.018Z | <Dan Mick> yay |
2024-08-23T18:36:17.989Z | <Casey Bodley> tyvm Dan |
2024-08-23T18:36:35.372Z | <Casey Bodley> what a mess :( |
2024-08-23T18:36:39.625Z | <Dan Mick> I mean ty, couldn't have done anything without you |
2024-08-23T18:37:07.679Z | <Dan Mick> but yeah do let's drive this to a potential customer impact eval |
2024-08-23T18:37:20.256Z | <Dan Mick> but yeah, do let's drive this to a potential customer impact eval |
2024-08-23T18:37:30.644Z | <Casey Bodley> this multisite config stuff is and always has been full of land mines. this is what we get for trying to clean up/simplify |
2024-08-23T18:37:40.708Z | <Dan Mick> it's complex |
2024-08-23T18:39:52.879Z | <Casey Bodley> it's complicated, but also tries to guess what the admin meant if it doesn't find a valid configuration. makes it really hard to reason about |
2024-08-23T18:43:42.967Z | <Dan Mick> I'm pretty sure Zack set that rgw instance up. I'll ask him if he remembers any confusion about realm/zone/zonegroup |
2024-08-23T18:44:03.892Z | <Dan Mick> but I do remember him telling me "I just followed the instructions for cephadm and it just worked" (in essence) |
2024-08-23T18:44:21.166Z | <Dan Mick> or, you know, cephadm-behind-ceph-orch |
2024-08-23T18:46:18.409Z | <Dan Mick> So the rgw instances behind [quay.ceph.io](http://quay.ceph.io) are back up, and I believe quay is probably working again. Partial explanation: some preexisting configuration (perhaps questionable) was confusing a new startup behavior in radosgw, and radosgw believed that it could not start. The config was stored in a rados object; @Casey Bodley walked me through where, why, and I backed up all the objects in the .rgw.root pool for postmortem, removed default.realm object, and the two gateways are now running without apparent problem. |
2024-08-23T19:20:50.276Z | <Casey Bodley> opened <https://tracker.ceph.com/issues/67697> and <https://github.com/ceph/ceph/pull/59422> |
2024-08-23T19:43:09.374Z | <Æmerson> Thank you for all the work from both of you. |
2024-08-23T21:08:42.118Z | <Dan Mick> fwiw, I reviewed the 19.1.0 and 19.1.1 release emails and didn't see anything that would have made me be cautious |
2024-08-23T21:11:32.892Z | <Dan Mick> also checked with zack; he has no memory of specifying anything about realm or zone |
2024-08-23T21:11:57.338Z | <Dan Mick> so I suspect we got at least some of this courtesy of cephadm. I don't know if there's a reasonable way to be sure |