ceph - cephfs - 2024-09-17

Timestamp (UTC)Message
2024-09-17T01:35:00.924Z
<jcollin> @Venky Shankar I was checking out this <https://tracker.ceph.com/issues/68001>. This is squid branch only and This is not an issue with failover. There's an issue with fs mount there, which prevents getting metrics. Need to check that.
2024-09-17T01:35:33.703Z
<jcollin> @Venky Shankar I was checking out this <https://tracker.ceph.com/issues/68001>. This is squid branch only and This is not an issue with failover (which is good). There's an issue with fs mount there, which prevents getting metrics. Need to check that.
2024-09-17T04:22:59.211Z
<Venky Shankar> kk
2024-09-17T04:23:30.437Z
<Venky Shankar> when you say issue with fs mount, do you mean the client isn't forwarding metrics?
2024-09-17T07:08:47.920Z
<Venky Shankar> @Rishabh Dave have you seen this
```        "detail": [
            {
                "message": "Module 'dashboard' has failed dependency: No module named 'dashboard.services.service'"
            },
            {
                "message": "Module 'volumes' has failed dependency: No module named 'volumes.fs.stats_util'"
            }
        ]```
2024-09-17T07:49:16.034Z
<Venky Shankar> nvm
2024-09-17T07:49:30.531Z
<Venky Shankar> Pulled in latest and seems its fixed.
2024-09-17T09:46:41.358Z
<Dhairya Parmar> > When the request-that-requires-no-Fb comes from a different client, the MDS must revoke the Fb cap from whoever holds it prior to answering the request
That means if a similar request comes from same client, MDS won't revoke it? If the MDS can do the same for this single client, keeping the caps in a (caps) wait list for the time while quiesce is undertaken, IMO this should just work like "request from different client", no>
2024-09-17T09:46:44.862Z
<Dhairya Parmar> > When the request-that-requires-no-Fb comes from a different client, the MDS must revoke the Fb cap from whoever holds it prior to answering the request
That means if a similar request comes from same client, MDS won't revoke it? If the MDS can do the same for this single client, keeping the caps in a (caps) wait list for the time while quiesce is undertaken, IMO this should just work like "request from different client", no?
2024-09-17T10:33:34.031Z
<Igor Golikov> Hi, finally I am able to run the debug env script on vossi02 machine, and I have the core dump. So the next question is : the core dump says that it was produced by `ceph-mds` , and of course I dont have the executable. So what is the common way of getting the executable? Should I build it myself (according to the branch that failed) or is it somewhere in the teuthology archive for the failed test?
2024-09-17T11:19:36.146Z
<Adam D> What do the following messages mean? I've had them for a while on all my kclients:
```[Tue Sep 17 11:10:22 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380190881]: mds0 hung
[Tue Sep 17 11:10:24 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380145058]: mds0 came back
[Tue Sep 17 11:10:24 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380190881]: mds0 came back
[Tue Sep 17 11:10:29 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380178140]: mds0 hung
[Tue Sep 17 11:10:33 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380178140]: mds0 came back
[Tue Sep 17 11:10:33 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380190902]: mds0 hung
[Tue Sep 17 11:10:33 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380196654]: mds0 hung
[Tue Sep 17 11:10:36 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380196654]: mds0 came back
[Tue Sep 17 11:10:36 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380190902]: mds0 came back
[Tue Sep 17 11:10:43 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380196693]: mds0 hung
[Tue Sep 17 11:10:45 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380196693]: mds0 came back```
2024-09-17T11:23:43.232Z
<jcollin> I'll check further on it.
2024-09-17T12:21:44.819Z
<Igor Golikov> Hi I have some urgent errands to run so I will not attend daily meeting. My update :
I am exploring admin sock
Learning teuthology
Trying to analyze the core dump from the tracker, right now I lack the corresponding binary
2024-09-17T12:23:25.778Z
<Anoop C S> RFR: <https://github.com/ceph/ceph/pull/59503>
2024-09-17T12:25:13.266Z
<Venky Shankar> Thanks, @Igor Golikov
2024-09-17T12:26:37.675Z
<Venky Shankar> ugh they seem to show up in 3-5 seconds. basically the client detected that mds was not unresponsive/unreachable and then  it was reachable back again.
2024-09-17T12:27:29.743Z
<Venky Shankar> do you see anything on the mds side or cluster log and/or cluster status relating to mds?
2024-09-17T12:31:53.776Z
<Adam D> with `debug_mds: 1/5` there are not many logs, nothing matches the times of occurrence on the client
2024-09-17T13:35:34.092Z
<Adam D> it looks like client has problem with renewcaps
2024-09-17T13:35:37.685Z
<Adam D> https://files.slack.com/files-pri/T1HG3J90S-F07MS9Z8M8B/download/image.png
2024-09-17T14:01:13.513Z
<gregsfortytwo> It won’t be in the archive, but all the teuthology binaries are available in shaman (for a few weeks) so you should be able to download it from there
2024-09-17T14:02:20.408Z
<gregsfortytwo> The teuthology.log file has the package URLs it is using, though you can also construct them yourself from the Ceph version
2024-09-17T14:06:32.067Z
<Igor Golikov> yeah now i found it
2024-09-17T14:29:47.414Z
<Adam D> ```RANK      STATE              MDS            ACTIVITY     DNS    INOS   DIRS   CAPS
 0        active      ceph-filesystem-a  Reqs:   76 /s  14.7M  13.7M   257k  5945k
0-s   standby-replay  ceph-filesystem-b  Evts:   70 /s  2035k   920k  54.0k     0```
2024-09-17T14:29:54.630Z
<Adam D> ceph.conf
```mds max caps per client = 356000
    mds min caps per client = 4096
    mds recall max decay rate = 2.0
    mds cache trim decay rate = 1.0
    mds recall max caps = 30000
    mds recall max decay threshold = 98304
    mds recall global max decay threshold = 196608
    mds recall warning threshold = 98304
    mds cache trim threshold = 196608```
2024-09-17T16:01:13.553Z
<Igor Golikov> Hi do I need to ask explicitly for an access to all sepia labs machines? Eg i can access vossi01 and smithi02 but not vosi02 ...
2024-09-17T16:01:27.586Z
<Igor Golikov> vossi01 is out of space and I cant do anything there
2024-09-17T16:29:35.150Z
<gregsfortytwo> the vossi machines get specific roles; we on the cephfs team reside on vossi04 so you should just use that one 🙂
2024-09-17T16:30:26.691Z
<gregsfortytwo> if you can’t reach it, we probably haven’t updated the set of users since your sepia access was granted. Adam can help with that
2024-09-17T18:10:52.069Z
<Adam D> I think I found the reason, the session autoclose fs parameter was smaller than session_timeout

Any issue? please create an issue here and use the infra label.