ceph - cephfs - 2024-10-23

Timestamp (UTC)Message
2024-10-23T10:48:04.079Z
<Dhairya Parmar> hey @Venky Shankar re: the purge queue tracker and BZ - <https://tracker.ceph.com/issues/68571>) you mentioned discussing this on slack.
2024-10-23T10:48:17.326Z
<Venky Shankar> yeh
2024-10-23T10:48:24.475Z
<Dhairya Parmar> (its about adding docs for the pq perf counters)
2024-10-23T10:48:32.863Z
<Venky Shankar> I saw your update.
2024-10-23T10:48:36.208Z
<Venky Shankar> Will have a look soon.
2024-10-23T10:51:31.672Z
<Dhairya Parmar> Alright, do let me know when you'd be available to discuss about documenting them
2024-10-23T10:51:48.152Z
<Venky Shankar> Sure, I'll go through the update first.
2024-10-23T11:58:42.631Z
<Dhairya Parmar> RFR for:
2024-10-23T11:58:51.129Z
<Dhairya Parmar> RFR for:
<https://github.com/ceph/ceph/pull/58564>
<https://github.com/ceph/ceph/pull/58481>
2024-10-23T11:58:53.831Z
<Dhairya Parmar> RFR for:
<https://github.com/ceph/ceph/pull/58564>
<https://github.com/ceph/ceph/pull/58481>
2024-10-23T12:00:48.798Z
<Dhairya Parmar> There is this quiesce PR <https://github.com/ceph/ceph/pull/58281> I had raised a few months back. Would be good to have some reviews on it too.
2024-10-23T14:40:31.647Z
<Thomas> Hi, it has been suggested that cross posting here may be more fruitful.

Essentially I am trying to recover ceph fs after a total mon failure. The rest of the ceph cluster is healthy. The ceph fs has multiple data pools because of erasure coding, but I can’t seem to add more than one data pool. 

[https://ceph-storage.slack.com/archives/C1HFU4JK1/p1729684013269279](https://ceph-storage.slack.com/archives/C1HFU4JK1/p1729684013269279)
2024-10-23T14:48:45.774Z
<gregsfortytwo> Looking at your ticket, I think you just need to use the pool commands to remove the metadata tagging the pool as belonging to the filesystem, then run the command to add it again
2024-10-23T14:49:23.598Z
<Thomas> I hope it’s that simple! 
2024-10-23T14:49:24.586Z
<gregsfortytwo> Since you only (?) lost the monitors, this is all just map manipulation and the FS itself should be in fine shape
2024-10-23T14:49:31.667Z
<Thomas> Which commands specifically? 
2024-10-23T14:50:25.971Z
<Thomas> To be clear, everything but the OSDs was lost. So I had to create a new cluster and restore it from the OSDs 
2024-10-23T14:50:34.523Z
<Thomas> To be clear, everything but the OSDs was lost. So I had to create a new cluster and restore into it from the OSDs 
2024-10-23T14:52:33.999Z
<gregsfortytwo> I don’t remember offhand, but there’s a way to manipulate pool metadata. Probably something like
> ceph osd pool <foo> metadata [ls|rm|set]
?
This metadata is used by the cluster to tag pools for particular applications, and is why the command to add it to the FS is failing — it’s identifying the pool as already allocated to an FS, but doesn’t have any detection for this case. We need to update the docs for that or add some intelligence to notice this scenario
2024-10-23T15:02:32.326Z
<Thomas> I can try this 
2024-10-23T15:25:17.789Z
<Thomas> I don't see any commands for pool metadata
2024-10-23T15:43:38.933Z
<gregsfortytwo> Oh, it’s “pool application” [https://docs.ceph.com/en/reef/rados/operations/pools/?highlight=pool+metadata#associating-a-pool-with-an-application](https://docs.ceph.com/en/reef/rados/operations/pools/?highlight=pool+metadata#associating-a-pool-with-an-application) and [https://docs.ceph.com/en/reef/man/8/ceph/?highlight=pool+application](https://docs.ceph.com/en/reef/man/8/ceph/?highlight=pool+application)
2024-10-23T15:48:01.896Z
<Thomas> ```bash-4.4$ ceph osd pool application get main-nvme-ec 
{
    "cephfs": {
        "data": "main"
    }
}
bash-4.4$ ceph osd pool application rm main-nvme-ec cephfs data
removed application 'cephfs' key 'data' on pool 'main-nvme-ec'
bash-4.4$ ceph fs add_data_pool main main-nvme-ec
  Pool 'main-nvme-ec' (id '41') has pg autoscale mode 'on' but is not marked as bulk.
  Consider setting the flag by running
    # ceph osd pool set main-nvme-ec bulk true
added data pool 41 to fsmap
bash-4.4$ ceph fs set main joinable true
main marked joinable; MDS may join as newly active.
bash-4.4$ ceph fs status
main - 1 clients
====
RANK      STATE        MDS       ACTIVITY     DNS    INOS   DIRS   CAPS  
 0        active      main-b  Reqs:    0 /s     0      0      0      0   
0-s   standby-replay  main-a  Evts:    0 /s     0      0      0      0   
     POOL        TYPE     USED  AVAIL  
main-metadata  metadata   765M  3811G  
 main-default    data    24.0k  3811G  
 main-nvme-ec    data    6684G  7622G  
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
bash-4.4$ ceph fs subvolume ls main
[]```
😕
2024-10-23T15:48:46.807Z
<Thomas> should I delete the ceph fs and recreate it with `--force --recover` again? Is there a safe way to do that?
2024-10-23T15:52:40.996Z
<Thomas> Wait, I was looking in the wrong place
2024-10-23T15:55:09.679Z
<Thomas> I don't know how  to view this from the cli: https://files.slack.com/files-pri/T1HG3J90S-F07TKCH3CPK/download/image.png
2024-10-23T15:56:25.762Z
<Thomas> I am shocked, it worked
2024-10-23T15:56:30.095Z
<Thomas> Thank you so much
2024-10-23T15:56:47.037Z
<Thomas> I will update the ceph tracker ticket to reflect this
2024-10-23T16:00:40.344Z
<Thomas> <https://tracker.ceph.com/issues/68682>
2024-10-23T16:18:59.955Z
<gregsfortytwo> I have no idea how the mgr module would handle this sequence so you may need to restart the mgr for “subvolume ls” to work 🤷‍♂️ 
Glad it seems to be up now though 👍 
2024-10-23T16:52:27.355Z
<Thomas> Really appreciate your guidance 🙂 Thank you so much
2024-10-23T16:52:35.351Z
<Thomas> I really thought I was going to have to delete it
2024-10-23T16:52:43.880Z
<Thomas> It's been a very valuable learning experience
2024-10-23T17:13:35.772Z
<Thomas> Although, there should be subvolumes right?

```bash-4.4$ ceph fs subvolume ls main
[]```
2024-10-23T17:13:46.010Z
<Thomas> I restarted the mgrs and it still is empty
2024-10-23T17:51:54.482Z
<gregsfortytwo> Yeah I thought this was just doing a directory “ls”, but maybe there’s more to it than that. @Neeraj Sharma @Kotresh H R any ideas?
2024-10-23T17:52:12.625Z
<gregsfortytwo> Yeah I thought this was just doing a directory “ls”, but maybe there’s more to it than that. @Neeraj Pratap Singh @Kotresh H R any ideas?
2024-10-23T22:02:05.915Z
<Ivveh> check your subvolume groups
2024-10-23T22:02:28.639Z
<Ivveh> when you list subvolumes you need to specify them if they are in there, otherwise you will list `_nogroup`
2024-10-23T22:03:16.822Z
<Ivveh> ie. `ceph fs subvolume ls <fsname> <subvolume group>`
2024-10-23T22:03:42.639Z
<Ivveh> and since you seem to be using csi and k8s
2024-10-23T22:03:55.634Z
<Ivveh> it should be `csi` if you didnt modify it
2024-10-23T22:04:24.104Z
<Ivveh> ie. `ceph fs subvolume ls main csi`
2024-10-23T22:05:27.093Z
<Ivveh> ah, just looked at the pic, can confirm its `csi` 🙂
2024-10-23T22:06:26.050Z
<Ivveh> just out of curiosity, how does one lose all the monitors?
2024-10-23T22:25:56.021Z
<Thomas> Thanks! I will try that command later
2024-10-23T22:28:25.138Z
<Thomas> Single large node, and a consumer WD SSD for the OS which spontaneously died. I really didn't expect an SN 850 to be so fragile, have replaced it with a PM983 and will take measures to either backup the disk or add two small nodes to the cluster simply for the purpose of keeping the mons (and k8s control plane) alive.
2024-10-23T22:28:50.956Z
<Thomas> Single large node, and a consumer WD SSD for the OS which spontaneously died. I really didn't expect an SN 850 to just die? I've replaced it with a PM983 and will take measures to either backup the disk or add two small nodes to the cluster simply for the purpose of keeping the mons (and k8s control plane) alive.
2024-10-23T22:39:44.117Z
<Ivveh> 🪺

Any issue? please create an issue here and use the infra label.