ceph - cephfs - 2025-01-06

Timestamp (UTC)Message
2025-01-06T20:30:24.788Z
<gregsfortytwo> The big MDS lock is never held while network activity is happening, so that invocation is just queuing things up and working in memory, then will return
2025-01-06T20:31:22.109Z
<gregsfortytwo> This is part of our usual event loop pattern where we dispatch messages or Contexts and they return to the caller once they have to wait on input from elsewhere
2025-01-06T20:38:02.587Z
<Md Mahamudur Rahaman Sajib> I got your point then why in the scrubstack it is still having `ceph_assert(ceph_mutex_is_locked(mdcache->mds->mds_lock))`
in the `void ScrubStack::kick_off_scrubs()` function? Doesn't that mean 2 scrub can not happen concurrently and also `void ScrubStack::scrub_abort(Context *on_finish)`  also have the same
`ceph_assert(ceph_mutex_is_locked_by_me(mdcache->mds->mds_lock));` If kick_of_scrubs is holding that lock then how abort will happen ? abort can only acquire lock after scrub finishes, isn't it?

Okay let me be specific about the question, let's say I started a scrub from CLI, it acquired the lock and queued the scrub job, when that lock released, after queuing or after finishing the scrub?
2025-01-06T20:38:55.385Z
<gregsfortytwo> That lock is released after queuing.
2025-01-06T20:41:11.844Z
<gregsfortytwo> I suggest you look through a couple simpler operations like getattr and setattr and make sure you grok the event loop and locking rules if you’re digging in to this
2025-01-06T20:46:49.045Z
<Md Mahamudur Rahaman Sajib> Sure, I will look into it, but my next question where this confusion occurs,
```void ScrubStack::kick_off_scrubs()
{
  ceph_assert(ceph_mutex_is_locked(mdcache->mds->mds_lock));```
Why this ceph_assert when `kich_off_scrubs` this step is after queuing is done(And this function is a callback function after each inode scrub in the directory tree), shouldn't be this ceph_assert getting false(which is not the case).

Also I did some testing, I put some delay from example 100s sleep, and from 2 panel I started scrub job. Second scrub job exactly starts after first scrub job finishes(which is 100s)
2025-01-06T20:48:13.488Z
<Md Mahamudur Rahaman Sajib> Sure, I will look into it, but my next question where this confusion occurs,
```void ScrubStack::kick_off_scrubs()
{
  ceph_assert(ceph_mutex_is_locked(mdcache->mds->mds_lock));```
Why this ceph_assert when `kich_off_scrubs` this step is after queuing is done(And this function is a callback function after each inode scrub in the directory tree), shouldn't be this ceph_assert getting false(which is not the case).

Also I did some testing, I put some delay for example 100s sleep in the scrub code, and from 2 panel I started scrub job. Second scrub job exactly starts after first scrub job finishes(which is 100s)
2025-01-06T20:49:15.017Z
<Md Mahamudur Rahaman Sajib> Sure, I will look into it, but my next question where this confusion occurs,
```void ScrubStack::kick_off_scrubs()
{
  ceph_assert(ceph_mutex_is_locked(mdcache->mds->mds_lock));```
Why this ceph_assert when `kich_off_scrubs` this step is after queuing is done(And this function is a callback function after each inode scrub in the directory tree), shouldn't be this ceph_assert getting false(which is not the case).

Also I did some testing, I put some delay for example 100s sleep in the scrub code, and from 2 panel I started scrub job. Second scrub job exactly starts after first scrub job finishes(which is 100s), Same thing happned for scrub abort.
2025-01-06T20:56:54.011Z
<Md Mahamudur Rahaman Sajib> Sure, I will look into it, but my next question where this confusion occurs,
```void ScrubStack::kick_off_scrubs()
{
  ceph_assert(ceph_mutex_is_locked(mdcache->mds->mds_lock));```
Why this ceph_assert when `kich_off_scrubs` this step is after queuing is done(And this function is a callback function after each inode scrub in the directory tree), shouldn't be this ceph_assert getting false(which is not the case).

Also I did some testing, I put some delay for example 100s sleep in the scrub code, and from 2 panel I started 2 scrub job (`./bin/ceph daemon mds.a scrub start /root1 recursive repair` .this way) Second scrub job exactly starts after first scrub job finishes(which is 100s), Same thing happned for scrub abort.
2025-01-06T22:19:59.296Z
<Md Mahamudur Rahaman Sajib> I understood now, I was completely wrong about that event loop. Thanks for the explanation.
2025-01-06T22:28:31.454Z
<Md Mahamudur Rahaman Sajib> I understood now, I was completely wrong. Thanks for the explanation.

Any issue? please create an issue here and use the infra label.