ceph - cephfs - 2024-11-19

Timestamp (UTC)Message
2024-11-19T05:30:53.286Z
<Venky Shankar> @Patrick Donnelly PTAL <https://tracker.ceph.com/issues/64677#note-4> (not urgent at all, so feel free to reschedule this nudge).
2024-11-19T13:22:38.031Z
<Patrick Donnelly> responded
2024-11-19T13:29:15.407Z
<Igor Golikov> Running 5 mins late for daily
2024-11-19T13:57:29.976Z
<Markuze> Hey, folks.
Other than messenger tasks hogging the CPU i see these errors.
Its the kernel client complaining that a CAP_O_GRANT was received for an inode it cant find.

Has anyone seen something like this? this is a frequent occurrence.
```[Nov19 15:48] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176be op 0, seq 5
[  +1.307288] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176bf op 0, seq 5
[  +1.441070] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c0 op 0, seq 5
[  +0.513919] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c1 op 0, seq 5
[  +0.318858] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c2 op 0, seq 5
[  +0.438481] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c3 op 0, seq 5
[  +0.384798] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c4 op 0, seq 5
[  +0.453058] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c5 op 0, seq 5
[  +0.365346] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c6 op 0, seq 5
[  +0.386730] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c7 op 0, seq 5
[  +0.719176] ceph: [740e6cbd-11f6-4357-89db-d4d4f82d4b61 21446]: from mds0, can't find ino fffffffffffffffe:100000176c8 op 0, seq 5```
2024-11-19T15:12:49.755Z
<gregsfortytwo> What’s that -1 it’s outputting — is that the Filesystem ID or something?
2024-11-19T15:13:32.101Z
<gregsfortytwo> Looks like it’s counting up somehow which is odd
2024-11-19T15:13:47.905Z
<Markuze> Its SNAP = CEPH_NOSNAP
2024-11-19T15:15:06.726Z
<Markuze> It prints the inode.snap:inode, op, seq.
OP is always CAP_OP_GRANT
2024-11-19T15:15:18.057Z
<Markuze> It prints the inode.snap:inode, op, seq.
OP is always CAP_OP_GRANT = 0
2024-11-19T15:16:07.290Z
<gregsfortytwo> So it’s just counting up the inodes and the client doesn’t recognize a huge portion?
2024-11-19T15:16:37.975Z
<Markuze> so it seems.
2024-11-19T15:16:51.623Z
<gregsfortytwo> Is this a reconnect or something? I’d check what the MDS thinks is happening because this is obviously strange
2024-11-19T15:40:07.384Z
<Markuze> I'm not sure, why its happening, I'm investigating.

I found these two issues so far.

<https://tracker.ceph.com/issues/68980>
<https://tracker.ceph.com/issues/68981>

We are also failing all xfstest ceph/ tests, but I'm not sure if its a test issue or a ceph issue.
2024-11-19T15:57:29.699Z
<gregsfortytwo> this is a brand-new behavior, right? Do you have any patches under test, or new stuff merged to testing?
2024-11-19T15:58:38.878Z
<gregsfortytwo> I imagine the logging is part of the reason the tests are slow, if it’s counting up from 10000000000 to 10000017cc8 (that’s 97480 inode lines printed out!)
2024-11-19T16:30:03.306Z
<Markuze> Its not printing on every inode, but there is a bunch.
I don't know when was the last stable commit ill try finding it. Ill check for linus because that's what we have for upcoming 9 and 10 downstream relases
2024-11-19T16:30:34.874Z
<gregsfortytwo> I mean this could be a server issue too, but this is very abrupt for me
2024-11-19T16:31:12.139Z
<Markuze> The test do eventually succeed.
2024-11-19T16:31:22.381Z
<Markuze> The tests do eventually succeed.
2024-11-19T16:33:03.276Z
<Markuze> I had Blustore overflow and warnings and one MDS started being behind on trimming. I don't know, it doesnt look like a CPU or Memory issue.
2024-11-19T17:40:11.722Z
<Markuze> @gregsfortytwo, running now for the `for-linus` branch, I still see CPU hogging warnings, but it seems to be benign, but I don't see any of the missing inode errors.
I'll let it run it takes a while.
There was a crust on a cryptfs test on the testing branch I want to see if that happens again.

Any issue? please create an issue here and use the infra label.