ceph - cephfs - 2024-07-08

Timestamp (UTC)Message
2024-07-08T02:46:47.735Z
<Xiubo Li> Found the root cause, it's a deadlock between the `lock cache` and the `create/unlink` , please see <https://tracker.ceph.com/issues/65607#note-7>.

Need to find a proper way to fix it
2024-07-08T02:48:56.122Z
<Xiubo Li> @Patrick Donnelly @Venky Shankar I want to add one mds options to allow us to disable the `lock cache` feature together with bug fixing, could you foresee any potential issue for this ?
2024-07-08T04:33:32.327Z
<Xiubo Li> @Patrick Donnelly @Venky Shankar I want to add one mds option to allow us to disable the `lock cache` feature together with bug fixing, could you foresee any potential issue for this ?
2024-07-08T04:39:15.717Z
<Jos Collin> @Venky Shankar The PR is ready?
2024-07-08T04:40:20.140Z
<Venky Shankar> no, sorry!
2024-07-08T04:40:27.918Z
<Venky Shankar> I worked on it on Friday
2024-07-08T04:40:36.617Z
<Venky Shankar> but couldn't finish it
2024-07-08T04:40:50.079Z
<Jos Collin> ok
2024-07-08T04:40:58.314Z
<Venky Shankar> Give me a few hours
2024-07-08T04:41:01.765Z
<Venky Shankar> will push a fix
2024-07-08T04:41:57.623Z
<Jos Collin> ok
2024-07-08T05:57:21.460Z
<Jos Collin> @Venky Shankar <https://tracker.ceph.com/issues/64761>, how much is the maximum number of sync operations? As I read from some of the qa tests, it should be 3?
2024-07-08T06:01:13.877Z
<Jos Collin> That should be a small value.
2024-07-08T06:16:50.490Z
<Venky Shankar> checking
2024-07-08T06:17:20.243Z
<Venky Shankar> there are 4 sync threads I believe
2024-07-08T06:21:20.532Z
<Jos Collin> I was referring to <https://github.com/ceph/ceph/blob/main/qa/tasks/cephfs/test_mirroring.py#L1332>.
2024-07-08T06:21:30.248Z
<Jos Collin> Looks like 3 by default?
2024-07-08T06:21:30.595Z
<Venky Shankar> ah
2024-07-08T06:21:35.795Z
<Venky Shankar> then yes 🙂
2024-07-08T06:21:41.677Z
<Venky Shankar> You can check [PeerReplayer.cc](http://PeerReplayer.cc)
2024-07-08T06:27:02.729Z
<Jos Collin> ok, will do. I think just make a configurable for setting the number of ops and a check/rejecting if the throttle is reached, somewhere in sync_snaps() would suffice.
2024-07-08T06:27:29.445Z
<Venky Shankar> well, throttler would actually do that
2024-07-08T06:27:40.602Z
<Venky Shankar> so, basically PeerReplayer can inherit throttler
2024-07-08T06:27:55.514Z
<Venky Shankar> no need to coding the throttle again
2024-07-08T06:28:00.383Z
<Jos Collin> aah, that's cool
2024-07-08T06:28:01.274Z
<Venky Shankar> check usages of Throttle class
2024-07-08T06:28:04.082Z
<Venky Shankar> yeh +1
2024-07-08T06:28:14.011Z
<Jos Collin> Let me check that then.
2024-07-08T06:28:26.934Z
<Jos Collin> thx
2024-07-08T08:51:00.638Z
<Jos Collin> @Venky Shankar Today's standup at 4pm India time?
2024-07-08T09:21:04.170Z
<Venky Shankar> yes
2024-07-08T09:21:12.522Z
<Venky Shankar> Most folks in US are on PTO this week.
2024-07-08T12:50:57.982Z
<Dhairya Parmar> @Venky Shankar remember the upstream user who hit <https://tracker.ceph.com/issues/61009> recently, well once they got their fs up and running but unfortunately the filesystem crashed again with
```/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: 275: FAILED ceph_assert(!p)

 ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7fc4693334a3]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7fc469333669]
 3: /usr/bin/ceph-mds(+0x2d2f04) [0x55caed408f04]
 4: (MDCache::cow_inode(CInode*, snapid_t)+0x334) [0x55caed40eeb4]
 5: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*, CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0x8d2) [0x55caed410792]
 6: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x966) [0x55caed35b596]
 7: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x6d5) [0x55caed3ab295]
 8: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x13c) [0x55caed3aff6c]
 9: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5db) [0x55caed2f827b]
 10: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x5c) [0x55caed2f88bc]
 11: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x55caed2e1c2f]
 12: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7fc4695aeed8]
 13: (DispatchQueue::entry()+0x50f) [0x7fc4695ac31f]
 14: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fc469676381]
 15: /lib64/libpthread.so.0(+0x81ca) [0x7fc4683211ca]
 16: clone()```
2024-07-08T12:51:37.267Z
<Venky Shankar> oh dear
2024-07-08T12:51:39.239Z
<Venky Shankar> that's nasty
2024-07-08T12:51:40.096Z
<Venky Shankar> 😕
2024-07-08T12:51:48.594Z
<Venky Shankar> they are using kclient I believe?
2024-07-08T12:52:14.432Z
<Dhairya Parmar> yeah, and to add to the issues they even turned off `nowsync`  with their kclient mounts
2024-07-08T12:52:44.724Z
<Venky Shankar> nowsync does not effect _`_unlink_local` crash_
2024-07-08T12:53:20.735Z
<Dhairya Parmar> okay
2024-07-08T12:53:31.876Z
<Dhairya Parmar> im reading cow_inode code
2024-07-08T12:53:43.599Z
<Dhairya Parmar> and the comment is interesting
```/*
 * note: i'm currently cheating wrt dirty and inode.version on cow
 * items.  instead of doing a full dir predirty, i just take the
 * original item's version, and set the dirty flag (via
 * mutation::add_cow_{inode,dentry}() and mutation::apply().  that
 * means a special case in the dir commit clean sweep assertions.
 * bah.
 */```
2024-07-08T12:53:50.119Z
<Venky Shankar> kclient passed in that bogus value I believe
2024-07-08T12:54:08.328Z
<Dhairya Parmar> while replaying jounrnal?
2024-07-08T12:54:12.795Z
<Dhairya Parmar> while replaying jounrnal you mean?
2024-07-08T12:55:02.458Z
<Venky Shankar> nope
2024-07-08T12:55:12.065Z
<Venky Shankar> In a client request and that got recorded in the journal
2024-07-08T12:55:25.698Z
<Dhairya Parmar> oh@
2024-07-08T12:55:27.450Z
<Venky Shankar> see: <https://tracker.ceph.com/issues/54546#note-7>
2024-07-08T12:55:29.670Z
<Dhairya Parmar> oh!
2024-07-08T12:58:14Z
<Dhairya Parmar> the assertion failure are different
2024-07-08T12:59:04.204Z
<Venky Shankar> kk
2024-07-08T12:59:09.301Z
<Venky Shankar> _unlink_local gave it away
2024-07-08T12:59:15.319Z
<Venky Shankar> I believe, this could be kind of related
2024-07-08T12:59:34.055Z
<Venky Shankar> btw, did the user run scrub post recovering from the interval_set failure?
2024-07-08T13:00:10.002Z
<Dhairya Parmar> yeah they did ran it, here's what they did

```We finished the scrub of the filesystem on Friday and ran a repair scrub on the 1 directory which had metadata damage. After doing so and rebooting, the cluster reported no issues and data was accessible again.```
2024-07-08T13:00:27.288Z
<Dhairya Parmar> so post scrub they did a repaid scrub, does that ring any bells?
2024-07-08T13:01:16.076Z
<Venky Shankar> that's expected
2024-07-08T13:01:21.242Z
<Venky Shankar> some repair might be normal
2024-07-08T13:01:36.298Z
<Dhairya Parmar> okay
2024-07-08T13:01:49.767Z
<Venky Shankar> I'll have to check what could have happened
2024-07-08T13:02:31.445Z
<Dhairya Parmar> https://files.slack.com/files-pri/T1HG3J90S-F07BTTXH3NV/download/ceph-mds.pebbles-s2.log-20240708.gz
2024-07-08T13:03:49.307Z
<Dhairya Parmar> this is the mds log that contains that assertion failure. once they hit this, they reset the journal again(basically following the same steps i sent them) but now the fs is in the failed state
2024-07-08T13:04:03.057Z
<Dhairya Parmar> they've shared logs. im downloading it currently
2024-07-08T13:07:43.409Z
<Venky Shankar> I'm not familiar with this backtrace, so resetting the journal might not help.
2024-07-08T13:07:54.116Z
<Venky Shankar> My guess is the fs is in a pretty bad state 😕
2024-07-08T13:08:05.294Z
<Dhairya Parmar> yikes
2024-07-08T13:10:32.807Z
<Dhairya Parmar> done downloading the log
2024-07-08T13:10:39.836Z
<Dhairya Parmar> it is filled with this line
```2024-07-08T10:19:19.935+0100 7f5201d08700 -1 log_channel(cluster) log [ERR] : loaded dup inode 0x1005e1773e4 [565,head] v3196 at /cephfs2-users/vchang/20240701_VC143-2_GluA3ATD-NB65/Movies/Images-Disc1/GridSquare_896930/Data/FoilHole_2014315_Data_918775_16_20240702_111309_Fractions.tiff.vifeyek0.partial, but inode 0x1005e1773e4.head v3198 already exists at /cephfs2-users/vchang/20240701_VC143-2_GluA3ATD-NB65/Movies/Images-Disc1/GridSquare_896930/Data/FoilHole_2014315_Data_918775_16_20240702_111309_Fractions.tiff```
2024-07-08T13:13:09.939Z
<Venky Shankar> dup inodes - two (non-hardlink) inodes which have the same inode number 🤷
2024-07-08T13:16:33.034Z
<Dhairya Parmar> did something go wrong while flushing
2024-07-08T13:18:07.538Z
<system> file ceph-mds.pebbles-s4.log.gz too big to download (3582651 > allowed size: 1000000)
2024-07-08T13:18:07.538Z
<Dhairya Parmar> full log post second journal reset attempt
2024-07-08T13:21:01.579Z
<Venky Shankar> Did someone recommend to reset the journal?
2024-07-08T13:21:24.171Z
<Dhairya Parmar> nope. they should've thought it is the same issue again and went for it
2024-07-08T13:22:30.622Z
<Dhairya Parmar> but the backtraces are the same, it has the same backtrace
2024-07-08T13:22:43.480Z
<Dhairya Parmar> but the backtraces are the same, it has the same assertion failure
2024-07-08T13:32:54.772Z
<Venky Shankar> we don't have enough information of what could have gone wrong
2024-07-08T13:55:05.334Z
<Dhairya Parmar> what should i ask them for better insights?
2024-07-08T13:55:26.112Z
<Dhairya Parmar> and what would be the solution to bail the fs out of this situation?

Any issue? please create an issue here and use the infra label.