2024-09-11T06:35:00.182Z | <Venky Shankar> @Xiubo Li around? can you see if something comes out of <https://tracker.ceph.com/issues/68002#note-2> |
2024-09-11T07:09:59.684Z | <Xiubo Li> Hi Venky, checking |
2024-09-11T08:12:31.335Z | <Kotresh H R> @Venky Shankar @Milind Changire I have fixed this. Please include in the testing |
2024-09-11T08:17:02.121Z | <Xiubo Li> @Venky Shankar From the call trace it just says that the MDS didn't reply the cap flush request. |
2024-09-11T08:17:17.934Z | <Xiubo Li> And kept waiting |
2024-09-11T08:17:49.717Z | <Xiubo Li> We need to check what has happened in MDS, maybe stuck for some reasons |
2024-09-11T08:18:05.963Z | <Venky Shankar> Normally the heartbeat_check messages are seen when there is a kernel lockup or something. |
2024-09-11T08:18:29.938Z | <Xiubo Li> Okay, it should be stuck |
2024-09-11T08:18:35.556Z | <Xiubo Li> Let me have a look |
2024-09-11T08:22:33.877Z | <Xiubo Li> ```Entering kdb (current=0xffff9dbea9f88000, pid 60974) on processor 2 Oops: (null)
due to oops @ 0xffffffffbb68895a
CPU: 2 PID: 60974 Comm: fsstress Kdump: loaded Not tainted 5.14.0-503.el9.x86_64 #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:usercopy_abort+0x74/0x76
Code: 5e 03 bc 51 48 0f 44 d6 49 c7 c3 3c e9 02 bc 4c 89 d1 57 48 c7 c6 10 e6 04 bc 48 c7 c7 b0 e5 04 bc 49 0f 44 f3 e8 68 31 ff ff <0f> 0b 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 2f e6 04 bc e8 73
RSP: 0018:ffffb26d01e67698 EFLAGS: 00010246
RAX: 000000000000006b RBX: 0000000000000112 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9dc5dfca08c0 RDI: ffff9dc5dfca08c0
RBP: 0000000000047000 R08: 0000000000000000 R09: ffffb26d01e674a0
R10: ffffb26d01e67498 R11: ffffffffbcbe93e8 R12: ffff9dbe80042a00
R13: 0000000000000001 R14: ffff9dbfb02d26b0 R15: ffff9dbfd8866712
FS: 00007f61339b7740(0000) GS:ffff9dc5dfc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffe0f342f00 CR3: 000000018dbf4002 CR4: 00000000001706f0
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? show_trace_log_lvl+0x1c4/0x2df
more>``` |
2024-09-11T08:22:42.832Z | <Xiubo Li> This is another crash in another node |
2024-09-11T08:24:04.335Z | <Xiubo Li> I didn't see the ceph logs |
2024-09-11T08:24:47.280Z | <Xiubo Li> ```[xiubli@vossi04 7891795]$ ls remote/smithi017/coredump/
1725618049.77537.core 1725618049.77538.core 1725618049.77539.core 1725618049.77540.core 1725618049.77541.core 1725618049.77542.core 1725618049.77543.core 1725618049.77544.core 1725618049.77545.core 1725618049.77546.core 1725618049.77547.core``` |
2024-09-11T08:24:55.861Z | <Xiubo Li> A lot of daemons crashed in one node |
2024-09-11T08:26:08.380Z | <Xiubo Li> IMO it should be the above kernel crash made the node down and then the corresponding mds daemons lost connection |
2024-09-11T08:26:21.993Z | <Xiubo Li> And also the osd daemons ? |
2024-09-11T10:05:34.643Z | <Dhairya Parmar> @Venky Shankar can you help me a bit with identifying the caps ? |
2024-09-11T10:06:49.477Z | <Venky Shankar> well, yeh. maybe tomorrow. I'm swamped today. |
2024-09-11T10:07:00.435Z | <Dhairya Parmar> 17:15:13 - pAsLsXsFscr
17:15:18 - pAsLsXsFscrb
17:15:43 - pAsxLsXsxFsxcrwb
Does it mean the inode had the read caps throughout the execution |
2024-09-11T10:07:13.849Z | <Venky Shankar> yes, it does look so. |
2024-09-11T10:07:15.660Z | <Dhairya Parmar> i see `r` in `Fscrb` |
2024-09-11T10:07:28.843Z | <Venky Shankar> just ensure you are seeing the correct place, i.e., issued and implemented |
2024-09-11T10:10:02.370Z | <Dhairya Parmar> ```2024-09-10T17:15:18.564+0530 7fce1cc006c0 10 client.4286 cap mds.0 issued pAsLsXsFscrb implemented pAsLsXsFscrb revoking -```
```2024-09-10T17:15:43.854+0530 7fce1cc006c0 10 client.4286 cap mds.0 issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -``` |
2024-09-11T10:10:22.839Z | <Dhairya Parmar> so indeed they were implemented |
2024-09-11T10:10:34.649Z | <Venky Shankar> kk |
2024-09-11T10:10:42.783Z | <Venky Shankar> so the clients indeed had the caps |
2024-09-11T10:10:50.386Z | <Dhairya Parmar> whats `b` in caps? buffer? |
2024-09-11T10:11:05.787Z | <Venky Shankar> so as we spoke ytesterday, check if there is a thread that's holding client_lock for the entire quiesce cycle. |
2024-09-11T10:11:15.005Z | <Dhairya Parmar> yea |
2024-09-11T10:11:16.220Z | <Venky Shankar> > whats `b` in caps? buffer?
yes |
2024-09-11T10:11:54.281Z | <Dhairya Parmar> > so as we spoke ytesterday, check if there is a thread that's holding client_lock for the entire quiesce cycle.
i was just confirming that the intuition is based on concrete evidences |
2024-09-11T10:12:15.405Z | <Dhairya Parmar> > yes
okay |
2024-09-11T10:13:57.055Z | <Dhairya Parmar> 15:13 is the time when quiesce was called (caps were `pAsLsXsFscr`), at 15:18 the quiesce was ongoing, and `b` was acquired (`pAsLsXsFscrb`). Why does quiesce need it 🤔 |
2024-09-11T11:57:03.792Z | <Dhairya Parmar> @Venky Shankar where can i find more info on iversion lock and inest lock? |
2024-09-11T11:57:34.705Z | <Dhairya Parmar> nevermind, `ceph/src/mds/SimpleLock.h` |
2024-09-11T17:33:27.087Z | <gregsfortytwo> hey @Patrick Donnelly I was just talking about quiesce with @Dhairya Parmar and it got me to thinking — how do we make sure a client can always flush out its dirty snapshot data? I guess this is actually not a new problem to quiesce but I don’t know the answer |
2024-09-11T17:36:17.203Z | <gregsfortytwo> like if client.a has dirty data buffered for a snapshot on file X; it has to write that data out to RADOS before client.b writes to the same objects with the new snapid |
2024-09-11T17:37:50.240Z | <gregsfortytwo> and I must have asked this question before but I sure don’t remember the answer |
2024-09-11T20:37:46.223Z | <Patrick Donnelly> I believe it's not an issue. client.a would have Fbc when quiesced. Snapshot is taken (creating dirty snapshot data), then quiesce released. For client.b to get Fw it would need to wait for client.a to receive Fw so it can flush Fb. |
2024-09-11T20:38:03.790Z | <Patrick Donnelly> Suggest @Dhairya Parmar actually try it to see what happens. |
2024-09-11T20:38:10.889Z | <Patrick Donnelly> should be easy to create a synthetic test. |
2024-09-11T20:42:37.085Z | <gregsfortytwo> You could have Fwb without Fc, but yes of course, as long as they have Fb nobody else can get Fw. 🫢 (And I just checked that because I was confused about it earlier today) |