ceph - cephfs - 2024-07-08

Timestamp (UTC)	Message
2024-07-08T02:46:47.735Z	<Xiubo Li> Found the root cause, it's a deadlock between the `lock cache` and the `create/unlink` , please see <https://tracker.ceph.com/issues/65607#note-7>. Need to find a proper way to fix it
2024-07-08T02:48:56.122Z	<Xiubo Li> @Patrick Donnelly @Venky Shankar I want to add one mds options to allow us to disable the `lock cache` feature together with bug fixing, could you foresee any potential issue for this ?
2024-07-08T04:33:32.327Z	<Xiubo Li> @Patrick Donnelly @Venky Shankar I want to add one mds option to allow us to disable the `lock cache` feature together with bug fixing, could you foresee any potential issue for this ?
2024-07-08T04:39:15.717Z	<Jos Collin> @Venky Shankar The PR is ready?
2024-07-08T04:40:20.140Z	<Venky Shankar> no, sorry!
2024-07-08T04:40:27.918Z	<Venky Shankar> I worked on it on Friday
2024-07-08T04:40:36.617Z	<Venky Shankar> but couldn't finish it
2024-07-08T04:40:50.079Z	<Jos Collin> ok
2024-07-08T04:40:58.314Z	<Venky Shankar> Give me a few hours
2024-07-08T04:41:01.765Z	<Venky Shankar> will push a fix
2024-07-08T04:41:57.623Z	<Jos Collin> ok
2024-07-08T05:57:21.460Z	<Jos Collin> @Venky Shankar <https://tracker.ceph.com/issues/64761>, how much is the maximum number of sync operations? As I read from some of the qa tests, it should be 3?
2024-07-08T06:01:13.877Z	<Jos Collin> That should be a small value.
2024-07-08T06:16:50.490Z	<Venky Shankar> checking
2024-07-08T06:17:20.243Z	<Venky Shankar> there are 4 sync threads I believe
2024-07-08T06:21:20.532Z	<Jos Collin> I was referring to <https://github.com/ceph/ceph/blob/main/qa/tasks/cephfs/test_mirroring.py#L1332>.
2024-07-08T06:21:30.248Z	<Jos Collin> Looks like 3 by default?
2024-07-08T06:21:30.595Z	<Venky Shankar> ah
2024-07-08T06:21:35.795Z	<Venky Shankar> then yes 🙂
2024-07-08T06:21:41.677Z	<Venky Shankar> You can check [PeerReplayer.cc](http://PeerReplayer.cc)
2024-07-08T06:27:02.729Z	<Jos Collin> ok, will do. I think just make a configurable for setting the number of ops and a check/rejecting if the throttle is reached, somewhere in sync_snaps() would suffice.
2024-07-08T06:27:29.445Z	<Venky Shankar> well, throttler would actually do that
2024-07-08T06:27:40.602Z	<Venky Shankar> so, basically PeerReplayer can inherit throttler
2024-07-08T06:27:55.514Z	<Venky Shankar> no need to coding the throttle again
2024-07-08T06:28:00.383Z	<Jos Collin> aah, that's cool
2024-07-08T06:28:01.274Z	<Venky Shankar> check usages of Throttle class
2024-07-08T06:28:04.082Z	<Venky Shankar> yeh +1
2024-07-08T06:28:14.011Z	<Jos Collin> Let me check that then.
2024-07-08T06:28:26.934Z	<Jos Collin> thx
2024-07-08T08:51:00.638Z	<Jos Collin> @Venky Shankar Today's standup at 4pm India time?
2024-07-08T09:21:04.170Z	<Venky Shankar> yes
2024-07-08T09:21:12.522Z	<Venky Shankar> Most folks in US are on PTO this week.
2024-07-08T12:50:57.982Z	<Dhairya Parmar> @Venky Shankar remember the upstream user who hit <https://tracker.ceph.com/issues/61009> recently, well once they got their fs up and running but unfortunately the filesystem crashed again with ```/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: 275: FAILED ceph_assert(!p) ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) 1: (ceph::__ceph_assert_fail(char const, char const, int, char const)+0x135) [0x7fc4693334a3] 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7fc469333669] 3: /usr/bin/ceph-mds(+0x2d2f04) [0x55caed408f04] 4: (MDCache::cow_inode(CInode, snapid_t)+0x334) [0x55caed40eeb4] 5: (MDCache::journal_cow_dentry(MutationImpl, EMetaBlob, CDentry, snapid_t, CInode, CDentry::linkage_t)+0x8d2) [0x55caed410792] 6: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry, CDentry)+0x966) [0x55caed35b596] 7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x6d5) [0x55caed3ab295] 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x13c) [0x55caed3aff6c] 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5db) [0x55caed2f827b] 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x55caed2f88bc] 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x55caed2e1c2f] 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7fc4695aeed8] 13: (DispatchQueue::entry()+0x50f) [0x7fc4695ac31f] 14: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fc469676381] 15: /lib64/libpthread.so.0(+0x81ca) [0x7fc4683211ca] 16: clone()```
2024-07-08T12:51:37.267Z	<Venky Shankar> oh dear
2024-07-08T12:51:39.239Z	<Venky Shankar> that's nasty
2024-07-08T12:51:40.096Z	<Venky Shankar> 😕
2024-07-08T12:51:48.594Z	<Venky Shankar> they are using kclient I believe?
2024-07-08T12:52:14.432Z	<Dhairya Parmar> yeah, and to add to the issues they even turned off `nowsync` with their kclient mounts
2024-07-08T12:52:44.724Z	<Venky Shankar> nowsync does not effect _`_unlink_local` crash_
2024-07-08T12:53:20.735Z	<Dhairya Parmar> okay
2024-07-08T12:53:31.876Z	<Dhairya Parmar> im reading cow_inode code
2024-07-08T12:53:43.599Z	<Dhairya Parmar> and the comment is interesting ```/* * note: i'm currently cheating wrt dirty and inode.version on cow * items. instead of doing a full dir predirty, i just take the * original item's version, and set the dirty flag (via * mutation::add_cow_{inode,dentry}() and mutation::apply(). that * means a special case in the dir commit clean sweep assertions. * bah. */```
2024-07-08T12:53:50.119Z	<Venky Shankar> kclient passed in that bogus value I believe
2024-07-08T12:54:08.328Z	<Dhairya Parmar> while replaying jounrnal?
2024-07-08T12:54:12.795Z	<Dhairya Parmar> while replaying jounrnal you mean?
2024-07-08T12:55:02.458Z	<Venky Shankar> nope
2024-07-08T12:55:12.065Z	<Venky Shankar> In a client request and that got recorded in the journal
2024-07-08T12:55:25.698Z	<Dhairya Parmar> oh@
2024-07-08T12:55:27.450Z	<Venky Shankar> see: <https://tracker.ceph.com/issues/54546#note-7>
2024-07-08T12:55:29.670Z	<Dhairya Parmar> oh!
2024-07-08T12:58:14Z	<Dhairya Parmar> the assertion failure are different
2024-07-08T12:59:04.204Z	<Venky Shankar> kk
2024-07-08T12:59:09.301Z	<Venky Shankar> _unlink_local gave it away
2024-07-08T12:59:15.319Z	<Venky Shankar> I believe, this could be kind of related
2024-07-08T12:59:34.055Z	<Venky Shankar> btw, did the user run scrub post recovering from the interval_set failure?
2024-07-08T13:00:10.002Z	<Dhairya Parmar> yeah they did ran it, here's what they did ```We finished the scrub of the filesystem on Friday and ran a repair scrub on the 1 directory which had metadata damage. After doing so and rebooting, the cluster reported no issues and data was accessible again.```
2024-07-08T13:00:27.288Z	<Dhairya Parmar> so post scrub they did a repaid scrub, does that ring any bells?
2024-07-08T13:01:16.076Z	<Venky Shankar> that's expected
2024-07-08T13:01:21.242Z	<Venky Shankar> some repair might be normal
2024-07-08T13:01:36.298Z	<Dhairya Parmar> okay
2024-07-08T13:01:49.767Z	<Venky Shankar> I'll have to check what could have happened
2024-07-08T13:02:31.445Z	<Dhairya Parmar> https://files.slack.com/files-pri/T1HG3J90S-F07BTTXH3NV/download/ceph-mds.pebbles-s2.log-20240708.gz
2024-07-08T13:03:49.307Z	<Dhairya Parmar> this is the mds log that contains that assertion failure. once they hit this, they reset the journal again(basically following the same steps i sent them) but now the fs is in the failed state
2024-07-08T13:04:03.057Z	<Dhairya Parmar> they've shared logs. im downloading it currently
2024-07-08T13:07:43.409Z	<Venky Shankar> I'm not familiar with this backtrace, so resetting the journal might not help.
2024-07-08T13:07:54.116Z	<Venky Shankar> My guess is the fs is in a pretty bad state 😕
2024-07-08T13:08:05.294Z	<Dhairya Parmar> yikes
2024-07-08T13:10:32.807Z	<Dhairya Parmar> done downloading the log
2024-07-08T13:10:39.836Z	<Dhairya Parmar> it is filled with this line ```2024-07-08T10:19:19.935+0100 7f5201d08700 -1 log_channel(cluster) log [ERR] : loaded dup inode 0x1005e1773e4 [565,head] v3196 at /cephfs2-users/vchang/20240701_VC143-2_GluA3ATD-NB65/Movies/Images-Disc1/GridSquare_896930/Data/FoilHole_2014315_Data_918775_16_20240702_111309_Fractions.tiff.vifeyek0.partial, but inode 0x1005e1773e4.head v3198 already exists at /cephfs2-users/vchang/20240701_VC143-2_GluA3ATD-NB65/Movies/Images-Disc1/GridSquare_896930/Data/FoilHole_2014315_Data_918775_16_20240702_111309_Fractions.tiff```
2024-07-08T13:13:09.939Z	<Venky Shankar> dup inodes - two (non-hardlink) inodes which have the same inode number 🤷
2024-07-08T13:16:33.034Z	<Dhairya Parmar> did something go wrong while flushing
2024-07-08T13:18:07.538Z	<system> file ceph-mds.pebbles-s4.log.gz too big to download (3582651 > allowed size: 1000000)
2024-07-08T13:18:07.538Z	<Dhairya Parmar> full log post second journal reset attempt
2024-07-08T13:21:01.579Z	<Venky Shankar> Did someone recommend to reset the journal?
2024-07-08T13:21:24.171Z	<Dhairya Parmar> nope. they should've thought it is the same issue again and went for it
2024-07-08T13:22:30.622Z	<Dhairya Parmar> but the backtraces are the same, it has the same backtrace
2024-07-08T13:22:43.480Z	<Dhairya Parmar> but the backtraces are the same, it has the same assertion failure
2024-07-08T13:32:54.772Z	<Venky Shankar> we don't have enough information of what could have gone wrong
2024-07-08T13:55:05.334Z	<Dhairya Parmar> what should i ask them for better insights?
2024-07-08T13:55:26.112Z	<Dhairya Parmar> and what would be the solution to bail the fs out of this situation?

ceph - cephfs - 2024-07-08

Any issue? please create an issue here and use the infra label.