2024-07-25T16:50:01.768Z | <gregsfortytwo> CephFS uses the OSDs as a normal client so doesn’t have that. Kind of expertise. In general storage systems want to fill caches to their limits so I presume you have some limits configured too high for the amount of real memory, if swap is getting used up. Many groups recommend disabling swap for Ceph servers |
2024-07-25T18:40:06.121Z | <Erich Weiler> Thanks! Do you have any idea what limits I can tune to help this? The caches are increasing at about 1GB/second on the OSD servers. |
2024-07-25T18:41:04.593Z | <Erich Weiler> I’ve tried `vm.vfs_cache_pressure=1000` but it doesn’t seem to help much… |
2024-07-25T18:49:40.118Z | <Erich Weiler> I’ve tried `vm.vfs_cache_pressure=1000` and `vm.swappiness=1` but it doesn’t seem to help much… |
2024-07-25T18:57:53.712Z | <Patrick Donnelly> RFR: <https://github.com/ceph/ceph/pull/58861> |
2024-07-25T20:24:08.773Z | <Mark Nelson (nhm)> @Patrick Donnelly Excellent news, the first 10 seconds of testing are looking really good! |
2024-07-25T20:24:47.884Z | <Mark Nelson (nhm)> I'll do a full run to make sure we don't see anything unexpected, but I think you cracked it. |
2024-07-25T20:46:54.043Z | <Mark Nelson (nhm)> hrm, I may have made IO stall. Saw this on one of the MDS logs: |
2024-07-25T20:46:58.692Z | <Mark Nelson (nhm)> ```2024-07-25T20:22:20.512+0000 7fdca33fe700 3 quiesce.mds.8 <quiesce_dispatch> error (-116) submitting q-db[v:(45:0) sets:0/0] from 4365``` |
2024-07-25T20:47:59.988Z | <Mark Nelson (nhm)> seeing those on a couple of other mdses as well. |
2024-07-25T20:48:06.131Z | <Mark Nelson (nhm)> rank 0 shows: |
2024-07-25T20:48:24.400Z | <Mark Nelson (nhm)> ```2024-07-25T20:19:20.975+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2024-07-25T20:22:15.981+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=5 was never sent ping request.
2024-07-25T20:22:20.982+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=10 was never sent ping request.
2024-07-25T20:22:25.982+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=15 was never sent ping request.
2024-07-25T20:22:30.982+0000 7f0be98f2700 -1 mds.pinger is_rank_lagging: rank=20 was never sent ping request.
2024-07-25T20:39:51.666+0000 7f0bee0fb700 0 --1- [v2:172.21.67.18:6860/445211041,v1:172.21.67.18:6861/445211041] >> v1:172.21.67.16:6855/1197928650 conn(0x564a1b11ac00 0x564a1bb64000 :6861 s=OPENED pgs=14 cs=1 l=0).fault initiating reconnect
2024-07-25T20:39:56.198+0000 7f0bef8fe700 0 --1- [v2:172.21.67.18:6860/445211041,v1:172.21.67.18:6861/445211041] >> v1:172.21.67.17:6859/3016582852 conn(0x564a1b858000 0x564a1b81b800 :-1 s=OPENED pgs=17 cs=1 l=0).fault initiating reconnect``` |