ceph - ceph-devel - 2024-08-20

Timestamp (UTC)Message
2024-08-20T07:44:44.173Z
<Yuval Lifshitz> c++17 did not help :-(
2024-08-20T11:19:51.368Z
<stachecki.tyler> Still taking a peek this morning, but I think we hit an interesting case where a ceph_assert led to osd_fast_shutdown being executed on many of our OSDs. The catch being the fast shutdown calls exit(0) and the systemd unit has Restart=on-failure. Hence the OSDs had to be manually started.
2024-08-20T11:21:00.779Z
<stachecki.tyler> Any thoughts about returning a nonzero RC in such cases to convince systemd to try booting the OSDs a few times? Or maybe this was done to avoid the risk of flapping OSDs?
2024-08-20T13:34:52.910Z
<Ivveh> is there any reason for not having nfs-ganesha in a separate container (other than "its not fully done yet")? it would help a lot to have a variable for nfs-ganesha docker image so that you could easily change version of it and customize it more
2024-08-20T13:38:53.618Z
<John Mulligan> Traditionally, the ceph container contains the ganesha packages needed to run the nfs service. I don't think there are plans to split it apart.
2024-08-20T13:40:24.216Z
<John Mulligan> Please file a tracker issue to add a separate variable like `container_image_nfs` or something like that, if you would like this feature added.
2024-08-20T14:22:51.965Z
<Jan Horacek> hi, anyone on 18.2.x using new `ceph-exporter` daemon ? i found that we are missing preconfigured logrotate rule, so i added ceph-exporter daemon to list of daemons that get HUP signal during logrotate (containerized deployment, so `/var/log/ceph/<fsid>/ceph-client.ceph-exporter.*` logfile, logrotate in `/etc/logrotate.d/ceph-<fsid>`) and in the end i found, that the daemon does not understand HUP signal so its killed, the whole container is killed and ceph cluster gets into ceph health warning because of failed daemon (in couple of minutes, it restarts again, but this is not what one will expect)
2024-08-20T14:23:36.698Z
<Jan Horacek> hi, anyone on 18.2.x using new `ceph-exporter` daemon ? i found that we are missing preconfigured logrotate rule, so i added ceph-exporter daemon to list of daemons that get HUP signal during logrotate (containerized deployment, so `/var/log/ceph/<fsid>/ceph-client.ceph-exporter.*` logfile, logrotate in `/etc/logrotate.d/ceph-<fsid>`) and in the end i found, that the daemon does not understand HUP signal so its killed, the whole container is killed and ceph cluster gets into ceph health warning because of failed daemon (in couple of minutes, it restarts again, but this is not what one will expect)

i even found that rook project have issue that looks related to this but did not find any issue on ceph tracker
2024-08-20T15:57:40.427Z
<Casey Bodley> hi Jan, are you able to create one at <https://tracker.ceph.com/projects/ceph/issues/new>? if not, i'd be happy to if you'd share a link to the rook issue
2024-08-20T16:22:47.294Z
<Dan Mick> and what are those screenshots from?
2024-08-20T16:23:17.229Z
<Dan Mick> (and how do they answer my three questions?)
2024-08-20T18:05:16.662Z
<Yuval Lifshitz> any idea about `ninja clean` failure?
``` ninja clean
[2/2] Cleaning all built files...
FAILED: clean 
/usr/bin/ninja-build  -t clean 
Cleaning... ninja: error: remove(/usr/lib64/libcurl.so): Permission denied
0 files.
ninja: build stopped: subcommand failed.```
2024-08-20T21:48:39.805Z
<Jeroen Roodhart> For a conservative service that a storage platform should be, this is all surprisingly unconservative :-)

Luckily we have a repository mirror, so at least we have a point in time option to test the heck out of. So I need to find out if the clone3 issue persists with current rhel8 podman. If we can get away with that we might update the containers and buy some time to move to rhel9/alma9.

Thanks for your insights, that is helpful. I suppose these are the risks of staying “upstream”.
2024-08-20T23:47:23.595Z
<gregsfortytwo> Is a fresh build still broken due to nvme grog submodules? @Samuel Just mentioned this to me recently and I’m getting told it’s still an issue. But I would have expected it to get fixed pretty quickly…
2024-08-20T23:48:07.713Z
<gregsfortytwo> eg 
-- Build files have been written to: /home/mbenjamin/dev/ceph-cp/build
[mbenjamin@fedora build]$ ninja -j31
ninja: error: '/home/mbenjamin/dev/ceph-cp/src/nvmeof/gateway/control/proto/gateway.proto', needed by '/home/mbenjamin/dev/ceph-cp/build/src/[gateway.pb.cc](http://gateway.pb.cc)', missing and no known rule to make it
2024-08-20T23:53:51.299Z
<gregsfortytwo> Is a fresh build still broken due to nvme grpc submodules? @Samuel Just mentioned this to me recently and I’m getting told it’s still an issue. But I would have expected it to get fixed pretty quickly…

Any issue? please create an issue here and use the infra label.