2024-09-17T01:35:00.924Z | <jcollin> @Venky Shankar I was checking out this <https://tracker.ceph.com/issues/68001>. This is squid branch only and This is not an issue with failover. There's an issue with fs mount there, which prevents getting metrics. Need to check that. |
2024-09-17T01:35:33.703Z | <jcollin> @Venky Shankar I was checking out this <https://tracker.ceph.com/issues/68001>. This is squid branch only and This is not an issue with failover (which is good). There's an issue with fs mount there, which prevents getting metrics. Need to check that. |
2024-09-17T04:22:59.211Z | <Venky Shankar> kk |
2024-09-17T04:23:30.437Z | <Venky Shankar> when you say issue with fs mount, do you mean the client isn't forwarding metrics? |
2024-09-17T07:08:47.920Z | <Venky Shankar> @Rishabh Dave have you seen this
``` "detail": [
{
"message": "Module 'dashboard' has failed dependency: No module named 'dashboard.services.service'"
},
{
"message": "Module 'volumes' has failed dependency: No module named 'volumes.fs.stats_util'"
}
]``` |
2024-09-17T07:49:16.034Z | <Venky Shankar> nvm |
2024-09-17T07:49:30.531Z | <Venky Shankar> Pulled in latest and seems its fixed. |
2024-09-17T09:46:41.358Z | <Dhairya Parmar> > When the request-that-requires-no-Fb comes from a different client, the MDS must revoke the Fb cap from whoever holds it prior to answering the request
That means if a similar request comes from same client, MDS won't revoke it? If the MDS can do the same for this single client, keeping the caps in a (caps) wait list for the time while quiesce is undertaken, IMO this should just work like "request from different client", no> |
2024-09-17T09:46:44.862Z | <Dhairya Parmar> > When the request-that-requires-no-Fb comes from a different client, the MDS must revoke the Fb cap from whoever holds it prior to answering the request
That means if a similar request comes from same client, MDS won't revoke it? If the MDS can do the same for this single client, keeping the caps in a (caps) wait list for the time while quiesce is undertaken, IMO this should just work like "request from different client", no? |
2024-09-17T10:33:34.031Z | <Igor Golikov> Hi, finally I am able to run the debug env script on vossi02 machine, and I have the core dump. So the next question is : the core dump says that it was produced by `ceph-mds` , and of course I dont have the executable. So what is the common way of getting the executable? Should I build it myself (according to the branch that failed) or is it somewhere in the teuthology archive for the failed test? |
2024-09-17T11:19:36.146Z | <Adam D> What do the following messages mean? I've had them for a while on all my kclients:
```[Tue Sep 17 11:10:22 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380190881]: mds0 hung
[Tue Sep 17 11:10:24 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380145058]: mds0 came back
[Tue Sep 17 11:10:24 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380190881]: mds0 came back
[Tue Sep 17 11:10:29 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380178140]: mds0 hung
[Tue Sep 17 11:10:33 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380178140]: mds0 came back
[Tue Sep 17 11:10:33 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380190902]: mds0 hung
[Tue Sep 17 11:10:33 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380196654]: mds0 hung
[Tue Sep 17 11:10:36 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380196654]: mds0 came back
[Tue Sep 17 11:10:36 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380190902]: mds0 came back
[Tue Sep 17 11:10:43 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380196693]: mds0 hung
[Tue Sep 17 11:10:45 2024] ceph: [6cdd880b-51be-4125-b265-6eabae47d96c 380196693]: mds0 came back```
|
2024-09-17T11:23:43.232Z | <jcollin> I'll check further on it. |
2024-09-17T12:21:44.819Z | <Igor Golikov> Hi I have some urgent errands to run so I will not attend daily meeting. My update :
I am exploring admin sock
Learning teuthology
Trying to analyze the core dump from the tracker, right now I lack the corresponding binary |
2024-09-17T12:23:25.778Z | <Anoop C S> RFR: <https://github.com/ceph/ceph/pull/59503> |
2024-09-17T12:25:13.266Z | <Venky Shankar> Thanks, @Igor Golikov |
2024-09-17T12:26:37.675Z | <Venky Shankar> ugh they seem to show up in 3-5 seconds. basically the client detected that mds was not unresponsive/unreachable and then it was reachable back again. |
2024-09-17T12:27:29.743Z | <Venky Shankar> do you see anything on the mds side or cluster log and/or cluster status relating to mds? |
2024-09-17T12:31:53.776Z | <Adam D> with `debug_mds: 1/5` there are not many logs, nothing matches the times of occurrence on the client |
2024-09-17T13:35:34.092Z | <Adam D> it looks like client has problem with renewcaps |
2024-09-17T13:35:37.685Z | <Adam D> https://files.slack.com/files-pri/T1HG3J90S-F07MS9Z8M8B/download/image.png |
2024-09-17T14:01:13.513Z | <gregsfortytwo> It won’t be in the archive, but all the teuthology binaries are available in shaman (for a few weeks) so you should be able to download it from there |
2024-09-17T14:02:20.408Z | <gregsfortytwo> The teuthology.log file has the package URLs it is using, though you can also construct them yourself from the Ceph version |
2024-09-17T14:06:32.067Z | <Igor Golikov> yeah now i found it |
2024-09-17T14:29:47.414Z | <Adam D> ```RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active ceph-filesystem-a Reqs: 76 /s 14.7M 13.7M 257k 5945k
0-s standby-replay ceph-filesystem-b Evts: 70 /s 2035k 920k 54.0k 0``` |
2024-09-17T14:29:54.630Z | <Adam D> ceph.conf
```mds max caps per client = 356000
mds min caps per client = 4096
mds recall max decay rate = 2.0
mds cache trim decay rate = 1.0
mds recall max caps = 30000
mds recall max decay threshold = 98304
mds recall global max decay threshold = 196608
mds recall warning threshold = 98304
mds cache trim threshold = 196608``` |
2024-09-17T16:01:13.553Z | <Igor Golikov> Hi do I need to ask explicitly for an access to all sepia labs machines? Eg i can access vossi01 and smithi02 but not vosi02 ... |
2024-09-17T16:01:27.586Z | <Igor Golikov> vossi01 is out of space and I cant do anything there |
2024-09-17T16:29:35.150Z | <gregsfortytwo> the vossi machines get specific roles; we on the cephfs team reside on vossi04 so you should just use that one 🙂 |
2024-09-17T16:30:26.691Z | <gregsfortytwo> if you can’t reach it, we probably haven’t updated the set of users since your sepia access was granted. Adam can help with that |
2024-09-17T18:10:52.069Z | <Adam D> I think I found the reason, the session autoclose fs parameter was smaller than session_timeout |