ceph - ceph-devel - 2024-11-05

Timestamp (UTC)Message
2024-11-05T13:33:01.172Z
<Ken Carlile> I want to reiterate the need for a comprehensive guide to how the configuration is set up and how to look at the values with ceph config. I have found wild deviations in syntax (do I use slashes? do I use underscores? Sometimes one, sometimes the other!) there doesn't seem to be a comprehensive look at what might be under various nodes (a la sysctl letting you step between levels just by giving an incomplete key), and for the love of glod, ceph config get _doesn't take global as a "who" but ceph config set does!_ As a new administrator, this is just hellish to go through, especially when looking at documentation and slides that show commands that don't end up working for a completely unknown reason. For example, I'm looking at Dan van der Ster's excellent presentation about upmap (<https://www.youtube.com/watch?v=9lsByOMdEwc&t=1154s>, <https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf>) and I come to slide 19, where it gives the tl;dr (not that I'm going to follow that necessarily), but I want to look at what the settings on my cluster are. Guess what, there doesn't appear to be a mgr/balancer/max_misplaced on my Reef 18.2.4 cluster. There are things that sorta look like that, maybe, if I look at the dashboard configuration options, but...? Also, those ceph config set commands look off, because there's not a "who" parameter obviously in them. I have no way of knowing if they would work as written without trying them, because ceph get complains immediately about the lack of a who.
2024-11-05T15:11:00.115Z
<Ivveh> the presentation is wrong
2024-11-05T15:11:37.290Z
<Ivveh> ceph config [set rm] <who> <key> <value>
2024-11-05T15:12:02.461Z
<Ivveh> in your case `ceph config set mgr mgr/balancer/max_misplaced X`
2024-11-05T15:12:30.014Z
<Ivveh> and if you use rm, dont use value
2024-11-05T15:12:33.982Z
<Ken Carlile> but how do I read that key? I don't want to set it without knowing what it is to start
2024-11-05T15:12:38.184Z
<Ivveh> (this will set it to default)
2024-11-05T15:12:45.872Z
<Ken Carlile> ah
2024-11-05T15:13:03.694Z
<Ivveh> if its set (non-default) you can get all values with `ceph config dump`
2024-11-05T15:13:16.103Z
<Ken Carlile> ```# ceph config get mgr mgr/balancer/max_misplaced
Error ENOENT: unrecognized key 'mgr/balancer/max_misplaced'```
2024-11-05T15:13:50.083Z
<Ivveh> then there is get or show-with-defaults too if im not mistaken
2024-11-05T15:14:27.016Z
<Ivveh> key might not exist
2024-11-05T15:15:18.918Z
<Ken Carlile> there doesn't seem to be any option for showing config with defaults for ceph config dump
2024-11-05T15:15:26.410Z
<Ken Carlile> but this is pretty much what I'm getting at
2024-11-05T15:15:49.308Z
<Ivveh> there is, i dont know it at the top of my head.. give me a sec
2024-11-05T15:15:59.374Z
<Ken Carlile> there doesn't appear to be any central place in the documentation or otherwise that one can reference to really get a handle on it all
2024-11-05T15:17:12.876Z
<Ivveh> `ceph config show-with-defaults <who.id>`
2024-11-05T15:17:58.336Z
<Ken Carlile> so it's not under dump
2024-11-05T15:20:42.168Z
<Ivveh> dump will only show configured items
2024-11-05T15:20:51.121Z
<Ken Carlile> mm hm. still proving my point. 😄
2024-11-05T15:20:56.093Z
<Ivveh> yeah i agree with you, the config is a mess
2024-11-05T15:21:25.331Z
<Ken Carlile> I'm not going to try digging around to find it any more; at the moment it's not a primary concern. but thank you!
2024-11-05T15:22:30.559Z
<Ivveh> if the key exists you can do a `ceph config-key exists config/mgr/mgr/balancer/max_misplaced`
2024-11-05T15:23:11.423Z
<Ken Carlile> oh no, this goes back to the config-key thing I keep forgetting about
2024-11-05T15:24:21.680Z
<Ivveh> ceph config sets shit in config-key but with a prefix of config/<who> more or less
2024-11-05T15:25:01.950Z
<Ivveh> you can dump the config-key too
2024-11-05T15:25:10.717Z
<Ivveh> but it wont show you defaults
2024-11-05T15:25:29.400Z
<Ken Carlile> oh my head
2024-11-05T15:25:40.015Z
<Ivveh> yes
2024-11-05T15:25:46.292Z
<Ivveh> and soul
2024-11-05T15:28:33.477Z
<Ivveh> it would be great if any of these commands could just show all possible config, but none of them do
2024-11-05T15:29:43.192Z
<Ivveh> the config-key is kinda the source of truth but a large amount of guesswork has to be done or with auxiliary commands
2024-11-05T15:37:55.111Z
<Ivveh> `ceph config ls | grep balancer`
2024-11-05T15:38:21.874Z
<Ivveh> these seems to be the available ones that you can set via config that then set it in the config-key 😄
2024-11-05T15:39:20.998Z
<Ken Carlile> < insert blinking man gif here >
2024-11-05T15:39:42.674Z
<Ivveh> but then what they become in config-key is unclear
2024-11-05T15:43:06.311Z
<Ivveh> hmm, read the rest of that presentation
2024-11-05T15:43:12.907Z
<Ivveh> i would be careful doing that
2024-11-05T15:43:22.714Z
<Ken Carlile> I'm not doing any of it! 😄
2024-11-05T15:43:35.026Z
<Ivveh> "magic script"
2024-11-05T15:43:37.610Z
<Ivveh> made me stop
2024-11-05T15:43:41.613Z
<Ken Carlile> well, I'm trying to use his upmap-remapped.py script, but... it is not working.
2024-11-05T15:44:11.442Z
<Ken Carlile> so I guess I am doing that. 😄
2024-11-05T15:44:27.065Z
<Ivveh> well, if you want health ok and close your eyes better do ceph health mute
2024-11-05T15:44:36.544Z
<Ivveh> for 6w
2024-11-05T15:45:00.503Z
<Ken Carlile> I mostly want the backfill_toofull that don't make sense to go away. since none of my osds are anywhere near full.
2024-11-05T15:45:25.483Z
<Ivveh> solution, figure out what osd is causing the backfill_toofull
2024-11-05T15:45:43.132Z
<Ivveh> you can do that with `ceph pg`
2024-11-05T15:46:29.465Z
<Ivveh> and try setting it `ceph osd crush reweight osd.X 0`
2024-11-05T15:54:03.418Z
<Ken Carlile> I'm having a real hard time locating it to a single OSD.
2024-11-05T15:56:44.372Z
<Ivveh> cant remember if it shows in dump_stuck
2024-11-05T15:56:50.194Z
<Ivveh> `ceph pg dump_stuck`
2024-11-05T15:58:11.610Z
<Ken Carlile> as I understand it, it should show in ceph health detail, but it's not calling out any OSDs in particular; just the list of PGs
2024-11-05T15:59:23.254Z
<Ivveh> you can find it via `ceph pg map pgid`
2024-11-05T16:00:02.307Z
<Ivveh> you can also inspect it with `ceph pg` to figure out what is the issue
2024-11-05T16:00:05.486Z
<Eugen Block> I'm not entirely sure at the moment, but IIRC, those PGs would fill up an OSD if they _*were*_ to be back filled to the destination OSDs. It's not saying that OSDs are _*now*_ too full.
2024-11-05T16:00:14.312Z
<Ken Carlile> that's kind of what I was thinking.
2024-11-05T16:00:39.280Z
<Ivveh> yes it can be in the queue
2024-11-05T16:01:10.258Z
<Ken Carlile> so I just need to figure out how to kick it out of that state--most particularly on the one pg that is active+undersized+degraded+remapped+backfill_toofull because that's blocking the balancer.
2024-11-05T16:01:31.402Z
<Ivveh> is it not backfilling at all?
2024-11-05T16:01:39.368Z
<Ken Carlile> oh, it's backfilling, just not _that_ one.
2024-11-05T16:01:44.699Z
<Ken Carlile> gotta step away for ~ 1 hr
2024-11-05T16:01:45.301Z
<Ivveh> then no problem
2024-11-05T16:01:50.888Z
<Ivveh> just wait 🙂
2024-11-05T16:01:51.406Z
<Eugen Block> It should eventually resolve
2024-11-05T16:01:56.889Z
<Ken Carlile> oh fine. 😄
2024-11-05T16:02:41.465Z
<Eugen Block> And it doesn't block IO or something if the PG isn't inactive, so just be patient. 😉
2024-11-05T17:01:29.703Z
<Ken Carlile> I have stopped IO to the cluster while this is going on, and that's really what's driving my "hurry up." I suspect that I am being overly cautious in this arena.

Any issue? please create an issue here and use the infra label.