ceph - cephfs - 2024-08-20

Timestamp (UTC)Message
2024-08-20T06:57:03.802Z
<Igor Golikov> Hi team, excited to join cephfs! looking forward to work with you all!
2024-08-20T06:58:03.879Z
<Venky Shankar> Welcome @Igor Golikov
2024-08-20T06:59:02.460Z
<Xiubo Li> Welcome!
2024-08-20T07:19:11.210Z
<jcollin> Hi @Igor Golikov, Welcome!
2024-08-20T07:45:17.893Z
<Dhairya Parmar> Hey @Igor Golikov, welcome to the team 🙂
2024-08-20T09:40:04.017Z
<Rishabh Dave> Welcome! :)
2024-08-20T10:12:19.643Z
<Venky Shankar> @ixu
2024-08-20T10:12:23.928Z
<Venky Shankar> @Xiubo Li around?
2024-08-20T10:12:44.495Z
<Venky Shankar> Could you check <https://pulpito.ceph.com/vshankar-2024-08-14_07:23:44-fs-wip-vshankar-testing-20240814.051955-debug-testing-default-smithi/7854722/> when you are available a bit?
2024-08-20T10:14:03.479Z
<Venky Shankar> This is the stock kernel in centos9 stream and this failure resembles <https://tracker.ceph.com/issues/48640> for which the kernel fix <https://patchwork.kernel.org/project/ceph-devel/patch/20210106014726.77614-1-xiubli@redhat.com/> should have been ported I think.
2024-08-20T10:14:19.225Z
<Venky Shankar> So, either the fix isn't in the stock kernel or this is a new failure.
2024-08-20T10:31:56.081Z
<Xiubo Li> Hi Venky, checking
2024-08-20T10:44:43.576Z
<Xiubo Li> ```2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:diff --git a/debian/ceph-mds.postinst b/debian/ceph-mds.postinst
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:index dfe02d2308e..e69de29bb2d 100644
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:--- a/debian/ceph-mds.postinst
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:+++ b/debian/ceph-mds.postinst
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:@@ -1,42 +0,0 @@
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-#!/bin/sh
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-# postinst script for ceph-mds
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-#
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-# see: dh_installdeb(1)
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-set -e
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# summary of how this script can be called:
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#        postinst configure <most-recently-configured-version>
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#        old-postinst abort-upgrade <new-version>
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#        conflictor's-postinst abort-remove in-favour <package> <new-version>
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#        postinst abort-remove
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#        deconfigured's-postinst abort-deconfigure in-favour <failed-install-package> <version> [<removing conflicting-package> <version>]
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# for details, see <http://www.debian.org/doc/debian-policy/> or
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# the debian-policy package
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-case "$1" in
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-    configure)
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- start ceph-mds-all || :
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-    ;;
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-    abort-upgrade|abort-remove|abort-deconfigure)
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- :
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-    ;;
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-    *)
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-        echo "postinst called with unknown argument \`$1'" >&2
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-        exit 1
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-    ;;
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-esac
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-# dh_installdeb will replace this with shell code automatically
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-# generated by other debhelper scripts.
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-#DEBHELPER#
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-exit 0
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.932 DEBUG:teuthology.orchestra.run:got remote process result: 1```
2024-08-20T10:46:13.353Z
<Xiubo Li> The stock kernel have already included the kernel patch and this should be a new one.
2024-08-20T10:46:40.997Z
<Xiubo Li> Please create a new tracker and assigned it to me. I will have a look this week or later
2024-08-20T11:01:34.231Z
<Igor Golikov> Hey team, for some reason I have been assigned a MacBook pro, and I realize that there is no way to build CEPH on mac natively. Does anyone use MBP here? with any virtualization SW to run linux on it?
2024-08-20T11:03:27.460Z
<Rishabh Dave> i don't think anyones use MBP now, but a previous team member had found/built a way to do so. but i'm unaware of the details.
2024-08-20T11:03:36.823Z
<Rishabh Dave> i don't think anyones use MBP now, but (IIRC) a previous team member had found/built a way to do so. but i'm unaware of the details.
2024-08-20T11:07:58.517Z
<Dhairya Parmar> Building ceph is stuck at `Performing download step (download, verify and extract) for 'Boost'` since like half an hour now. Anyone got any idea?
2024-08-20T11:09:00.446Z
<Xiubo Li> It seems stuck downloading the dependent repos.
2024-08-20T11:10:03.305Z
<Dhairya Parmar> @Rishabh Dave once told me that if i have the required boost tar then it should be done, which i think i have
```dparmar:src$ ls
Boost  boost_1_82_0.tar.bz2  boost_1_85_0.tar.bz2  Boost-build  Boost-stamp  ex-Boost1234```
😕
2024-08-20T11:10:59.820Z
<Rishabh Dave> can you copy output of `ls -lh` for same dir?
2024-08-20T11:11:06.042Z
<Dhairya Parmar> time to run fedora in MBP XD
2024-08-20T11:11:14.785Z
<Rishabh Dave> it'll tell us if tar copied is of correct size.
2024-08-20T11:11:36.746Z
<Rishabh Dave> i'll compare it with tar.bz2 files in my work repo.
2024-08-20T11:11:38.350Z
<Xiubo Li> You can try to remove them and try again. Before I hit the similar issues several time. I resolved this by downloading them munaually. Some times It will work after deleting the incomplete repo and retrying it.
2024-08-20T11:11:57.670Z
<Dhairya Parmar> okay
2024-08-20T11:12:05.508Z
<Dhairya Parmar> so it does seem like pretty slow download
2024-08-20T11:12:11.176Z
<Dhairya Parmar> ```-rw-r--r--. 1 dparmar dparmar 7.1M Aug 20 16:41 boost_1_85_0.tar.bz2```
2024-08-20T11:12:28.786Z
<Rishabh Dave> file size is too small
2024-08-20T11:12:42.251Z
<Rishabh Dave> i guess you didn't cancel ninja command before copying...
2024-08-20T11:13:00.832Z
<Rishabh Dave> and therefore ninja ended up over-writing the tar file
2024-08-20T11:13:09.098Z
<Xiubo Li> When the repo sever may also slow for some reason, just try to retry it after removing the incomplete repo.
2024-08-20T11:13:30.757Z
<Dhairya Parmar> okay finally......
2024-08-20T11:13:40.042Z
<Dhairya Parmar> it had moved ...
2024-08-20T11:13:49.503Z
<Dhairya Parmar> it has moved ...
2024-08-20T11:14:02.927Z
<Rishabh Dave> nice.
2024-08-20T11:14:32.373Z
<Dhairya Parmar> lets host a server for this, i'll contribute some bucks XD
2024-08-20T11:15:18.004Z
<Rishabh Dave> usually cancel `ninja` command, copy , copy boost tar and then run `ninja` command again.
2024-08-20T11:15:18.468Z
<Xiubo Li> ```2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout:{
2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout:    "target_image": "[quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6](http://quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6)",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:    "in_progress": true,
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:    "which": "Upgrading all daemon types on all hosts",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:    "services_complete": [
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:        "crash",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:        "osd",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:        "mgr",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:        "mon" 
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:    ],
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:    "progress": "12/23 daemons upgraded",
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:    "message": "Currently upgrading osd daemons",
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:    "is_paused": false
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:}```
Anyone knows does this mean that the osd daemons' upgrading is stuck ? This status lasted for around 3 hours and no any progress.
2024-08-20T11:15:40.508Z
<Dhairya Parmar> wow
2024-08-20T11:16:09.043Z
<Xiubo Li> ```2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout:{
2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout:    "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:    "in_progress": true,
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:    "which": "Upgrading all daemon types on all hosts",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:    "services_complete": [
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:        "crash",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:        "osd",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:        "mgr",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:        "mon" 
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout:    ],
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:    "progress": "12/23 daemons upgraded",
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:    "message": "Currently upgrading osd daemons",
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:    "is_paused": false
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:}```
Anyone knows does this mean that the osd daemons' upgrading is stuck ? This status lasted for around 3 hours and no any progress.

This is possible blocking the IOs.
2024-08-20T11:16:30.013Z
<Dhairya Parmar> Anything in OSD logs?
2024-08-20T11:16:58.471Z
<Xiubo Li> No any useful logs found
2024-08-20T11:17:34.079Z
<Dhairya Parmar> I've had seen this but for mgr not osd. trying to recall it.
2024-08-20T11:18:12.299Z
<Xiubo Li> This is one qa failure
2024-08-20T11:18:28.287Z
<Xiubo Li> After 3 hours it seems the blocked IO just replied
2024-08-20T11:18:30.132Z
<Igor Golikov> what is XD? I am not familiar with it 🙂
2024-08-20T11:18:38.758Z
<Dhairya Parmar> this is upgrade suite right
2024-08-20T11:18:44.871Z
<Xiubo Li> correct
2024-08-20T11:18:50.466Z
<Igor Golikov> XD
2024-08-20T11:18:52.866Z
<Rishabh Dave> XD = 😄
2024-08-20T11:18:53.371Z
<Igor Golikov> XD
2024-08-20T11:19:00.388Z
<Igor Golikov> kidding 🙂 i know
2024-08-20T11:19:09.297Z
<Rishabh Dave> ok. XD
2024-08-20T11:19:33.701Z
<Igor Golikov> well the bottom line is - no way to run it without 3rd party SW ... parallel or whatever
2024-08-20T11:19:40.662Z
<Igor Golikov> thanks, will check it further on
2024-08-20T11:19:45.907Z
<Xiubo Li> This will block the `Fwb` caps to be released in the cleint side.
2024-08-20T11:21:01.581Z
<Rishabh Dave> i think most of didn't try because it was not worth the effort. non-MBP works just fine.
2024-08-20T11:21:48.804Z
<Dhairya Parmar> is this seen regularly?
2024-08-20T11:25:57.326Z
<Xiubo Li> No, just the first time to see this for me
2024-08-20T11:26:22.377Z
<Xiubo Li> This tracker <https://tracker.ceph.com/issues/67518>
2024-08-20T11:31:07.840Z
<Igor Golikov> thats correct, i just dont know why they gave nme MBP 🙂
2024-08-20T11:34:19.859Z
<Rishabh Dave> for me IT folks gave me choice between thinkpad and MBP, IIRC. they don't usually know what other team members are using.
2024-08-20T11:38:04.503Z
<Igor Golikov> they called me to ask, but I had no idea why not to choose MBP (in vmware we got only MBPs with Fusion, so you can run any type of VM on it)
2024-08-20T11:39:43.289Z
<Venky Shankar> @Igor Golikov Once you have ceph lab access, there is a dedicated node (vossi01) for cephfs devs to build/run ceph clusters (using vstart.sh).
2024-08-20T11:40:38.120Z
<Dhairya Parmar> do you have cephadm logs?
2024-08-20T11:41:00.667Z
<Venky Shankar> Amy recent patchs in the testing kernel?
2024-08-20T11:41:49.995Z
<Dhairya Parmar> right before the upgrade logs i also see this
```2024-08-02T17:20:31.446 INFO:teuthology.orchestra.run.smithi081.stdout:{
2024-08-02T17:20:31.446 INFO:teuthology.orchestra.run.smithi081.stdout:    "mon": {
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:        "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 2
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:    },
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:    "mgr": {
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:        "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 2
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:    },
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:    "osd": {
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:        "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 6
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:    },
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:    "mds": {
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:        "ceph version 18.2.2-1767-ga3bbd728 (a3bbd7289877bdcce87fd1f79da1a2d6578dde36) reef (stable)": 4
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:    },
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout:    "overall": {
2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout:        "ceph version 18.2.2-1767-ga3bbd728 (a3bbd7289877bdcce87fd1f79da1a2d6578dde36) reef (stable)": 4,
2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout:        "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 10
2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout:    }
2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout:}```
2024-08-20T11:42:04.980Z
<Dhairya Parmar> the osds are at `19.1.0-1260-g26c3fb8e`
2024-08-20T11:42:40.533Z
<Dhairya Parmar> the osds are at `19.1.0-1260-g26c3fb8e` , same with other daemons too apart from mds being at `18.2.2-1767-ga3bbd728`
2024-08-20T11:43:52.903Z
<Dhairya Parmar> @Xiubo Li i think i found something interesting `2024-08-02T17:20:31.824 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:31 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS`
2024-08-20T11:44:25.105Z
<Dhairya Parmar> and this is the code
```            if not self.upgrade_state.fail_fs:
                if not (mdsmap['in'] == [0] and len(mdsmap['up']) <= 1):
                    [self.mgr.log.info](http://self.mgr.log.info)(
                        'Upgrade: Waiting for fs %s to scale down to reach 1 MDS' % (
                            fs_name))
                    time.sleep(10)
                    continue_upgrade = False
                    continue```
2024-08-20T11:44:37.649Z
<Dhairya Parmar> it looks like it was stuck here continuously
2024-08-20T11:45:46.111Z
<jcollin> Beware of python version changes in a shared environment. I'd like to use a container in such cases.
2024-08-20T11:45:52.883Z
<Igor Golikov> got it. but dont u have any sandbox version of cluster? to run locally for tests/debug?
2024-08-20T11:46:59.785Z
<jcollin> [vstart.sh/mstart.sh](http://vstart.sh/mstart.sh)
2024-08-20T11:48:27.015Z
<jcollin> <https://docs.ceph.com/en/quincy/dev/dev_cluster_deployement/>
2024-08-20T11:51:02.231Z
<Xiubo Li> The cephadm logs should be in remote/smithi081/log/cephadm.log.gz
2024-08-20T11:51:11.901Z
<Xiubo Li> The cephadm logs should be in `remote/smithi081/log/cephadm.log.gz`
2024-08-20T11:51:21.651Z
<Rishabh Dave> does `vstart_runner .py`run fine for on vossi machines? last time i tried, it didn't work for me.
2024-08-20T11:51:49.395Z
<Xiubo Li> It seem the slwo request just block scaling down the fs ?
2024-08-20T11:51:57.710Z
<Xiubo Li> It seem the slow request just block scaling down the fs ?
2024-08-20T11:52:33.250Z
<Dhairya Parmar> > It seem the slow request just block scaling down the fs ?
im just wondering if that is the cause or the symptom
2024-08-20T11:52:53.359Z
<Xiubo Li> Do you mean the snap related ?
2024-08-20T11:53:20.482Z
<Dhairya Parmar> If we can compare the time stamp when the first slow request occurred and the the time when we see `Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS`  then it would help a bit
2024-08-20T11:53:39.654Z
<Xiubo Li> yeah, sounds reasonable
2024-08-20T11:53:44.221Z
<Xiubo Li> let me have a look
2024-08-20T11:55:09.353Z
<Venky Shankar> yeh
2024-08-20T11:56:06.640Z
<Dhairya Parmar> and i found another reason
```2024-08-02T17:20:19.659 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:19 smithi081 ceph-mon[92715]: Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)```
2024-08-20T11:56:18.206Z
<Xiubo Li> The revoke happened at  `2024-08-02T17:14:23`:

```2024-08-02T17:14:23.288+0000 7f2502ad1640 10 mds.0.15 send_message_client_counted client.24283 seq 24071 client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12
2024-08-02T17:14:23.288+0000 7f2502ad1640  1 -- [v2:172.21.15.81:6828/1710115680,v1:172.21.15.81:6829/1710115680] --> v1:172.21.15.81:0/3322994407 -- client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12 -- 0x560e59b78380 con 0x560e530e4000```
And then stuck.

Then the `fs` scale down happened 6 minutes later:

```2024-08-02T17:20:21.660 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS```
2024-08-20T11:57:26.741Z
<Xiubo Li> The revoke happened at  `2024-08-02T17:14:23`:

```2024-08-02T17:14:23.288+0000 7f2502ad1640 10 mds.0.15 send_message_client_counted client.24283 seq 24071 client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12
2024-08-02T17:14:23.288+0000 7f2502ad1640  1 -- [v2:172.21.15.81:6828/1710115680,v1:172.21.15.81:6829/1710115680] --> v1:172.21.15.81:0/3322994407 -- client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12 -- 0x560e59b78380 con 0x560e530e4000```
And then stuck.

Then the `fs` scale down happened 6 minutes later:

```2024-08-02T17:20:21.658 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: stopping daemon mds.cephfs.smithi110.ttqwpb
...
2024-08-02T17:20:21.660 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS```
2024-08-20T11:57:42.078Z
<Xiubo Li> No
2024-08-20T11:57:56.882Z
<Dhairya Parmar> so this means the request stalled the MDS?
2024-08-20T11:58:12.642Z
<Xiubo Li> I think so and need to confirm
2024-08-20T11:58:20.920Z
<Venky Shankar> The last run essentially had a bug in mon_thrash task that caused most fs:thrash to not run in its entirety.
2024-08-20T11:58:50.865Z
<Venky Shankar> And this branch has that fix in place, so we probably missed this earlier..
2024-08-20T11:59:00.200Z
<Dhairya Parmar> @Xiubo Li
```2024-08-02T17:20:08.158 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:07 smithi081 ceph-mon[92715]: Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)```
2024-08-20T11:59:02.107Z
<Venky Shankar> anyway, I'll check and create tracker.
2024-08-20T11:59:14.310Z
<Venky Shankar> works just fine for me
2024-08-20T11:59:20.629Z
<Dhairya Parmar> @Xiubo Li
```2024-08-02T17:20:07.769 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:20:07 smithi110 ceph-mon[81030]: Health check failed: 1 osds down (OSD_DOWN)
2024-08-02T17:20:07.769 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:20:07 smithi110 ceph-mon[81030]: Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)```
2024-08-20T11:59:48.702Z
<Dhairya Parmar> IMO the `require_osd_release` needs to be updated
2024-08-20T12:02:59.469Z
<Xiubo Li> Okay, let me check
2024-08-20T12:03:01.431Z
<Xiubo Li> ```2024-08-02T17:32:32.519+0000 7f2879e1d640  5 mds.1.7 shutdown_pass=false
2024-08-02T17:32:32.519+0000 7f2879e1d640 20 mds.beacon.cephfs.smithi110.ttqwpb 1 slow request found
2024-08-02T17:32:32.589+0000 7f2879e1d640 20 mds.1.7 get_task_status
2024-08-02T17:32:32.589+0000 7f2879e1d640 20 mds.1.7 schedule_update_timer_task
2024-08-02T17:32:32.619+0000 7f287961c640  5 mds.beacon.cephfs.smithi110.ttqwpb Sending beacon up:stopping seq 308```
2024-08-20T12:03:20.066Z
<Xiubo Li> It was the slow request just blocked the scaling down
2024-08-20T12:03:45.616Z
<Dhairya Parmar> so the `require_osd_release`  is not a blocker right
2024-08-20T12:03:51.360Z
<Dhairya Parmar> which turned down the osd
2024-08-20T12:05:18.483Z
<Xiubo Li> The cap revoke stuck and then the slow request happened. The cap revoke caused by the `Fwb` caps, which is waiting the data to be flushed to the Rados.
2024-08-20T12:05:37.265Z
<Xiubo Li> So I need to confirm whether the osd down issue caused the IO blocking
2024-08-20T12:07:54.522Z
<Xiubo Li> sure
2024-08-20T12:12:27.371Z
<Dhairya Parmar> @Xiubo Li the slow request warning is only seen after OSD upgrade starts:
```2024-08-02T17:17:19.543 INFO:teuthology.orchestra.run.smithi081.stderr:2024-08-02T17:17:19.542+0000 7f60deffd640  1 -- 172.21.15.81:0/2869246966 <== mgr.34104 v2:172.21.15.81:6800/3996574518 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+400 (secure 0 0 0) 0x7f60e802c450 con 0x7f60d406e2b0
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:{
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:    "target_image": "[quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6](http://quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6)",
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:    "in_progress": true,
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:    "which": "Upgrading all daemon types on all hosts",
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:    "services_complete": [
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:        "crash",
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:        "mgr",
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:        "mon"
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:    ],
2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout:    "progress": "6/23 daemons upgraded",
2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout:    "message": "Currently upgrading osd daemons",
2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout:    "is_paused": false
2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout:}```
2024-08-20T12:13:18.832Z
<Xiubo Li> Yeah, so I just suspect the osd upgrading blocked the IOs
2024-08-20T12:14:14.172Z
<Rishabh Dave> okay, i had tried running couple of weeks ago and it wasn't working for me. IIRC vstart.sh failed everytime i tried. i'll try again...
2024-08-20T12:14:21.339Z
<Xiubo Li> Yeah, so I just suspect the osd upgrading blocked the IOs, which is the dirty data write back
2024-08-20T12:14:21.414Z
<Dhairya Parmar> So as soon the upgrade started, the client req couldn't be flushed
2024-08-20T12:15:11.083Z
<Xiubo Li> it should be the dirty data flushing from buffer is blocked
2024-08-20T12:15:11.900Z
<Dhairya Parmar> before the first slow req warning i see this
```2024-08-02T17:17:27.946 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:27 smithi081 ceph-mon[92715]: pgmap v43: 65 pgs: 34 active+undersized+degraded, 31 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 0 B/s rd, 3 op/s; 5168/34245 objects degraded (15.091%)```
2024-08-20T12:15:46.415Z
<Dhairya Parmar> > it should be the dirty data flushing from buffer is blocked
exactly! the objects are degraded
2024-08-20T12:16:13.419Z
<Dhairya Parmar> see this
```2024-08-02T17:17:28.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:27 smithi110 ceph-mon[81030]: Health check failed: Degraded data redundancy: 5168/34245 objects degraded (15.091%), 34 pgs degraded (PG_DEGRADED)
2024-08-02T17:17:28.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:27 smithi110 ceph-mon[81030]: Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 17 pgs peering)
2024-08-02T17:17:28.967 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:28 smithi081 ceph-mon[92715]: Health check failed: 2 MDSs report slow requests (MDS_SLOW_REQUEST)```
2024-08-20T12:17:30.464Z
<Dhairya Parmar> and the degradation is because an osd had died
```2024-08-02T17:17:21.602 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0[48060]: 2024-08-02T17:17:21.171+0000 7fbf58d9a640 -1 osd.0 50 *** Got signal Terminated ***
2024-08-02T17:17:21.602 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0[48060]: 2024-08-02T17:17:21.171+0000 7fbf58d9a640 -1 osd.0 50 *** Immediate shutdown (osd_fast_shutdown=true) ***
2024-08-02T17:17:21.908 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-mon[92715]: pgmap v38: 65 pgs: 65 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 1.7 KiB/s rd, 3 op/s
2024-08-02T17:17:21.908 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-mon[92715]: osd.0 marked itself down and dead
2024-08-02T17:17:22.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:21 smithi110 ceph-mon[81030]: pgmap v38: 65 pgs: 65 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 1.7 KiB/s rd, 3 op/s
2024-08-02T17:17:22.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:21 smithi110 ceph-mon[81030]: osd.0 marked itself down and dead
2024-08-02T17:17:22.382 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:22 smithi081 podman[101253]: 2024-08-02 17:17:22.069729155 +0000 UTC m=+1.005376791 container died 6d71c06ca77f31c078d37dcfc7db45a2f9b4ffd8ea6eacddd8e36932d96cb6ac (image=[quay.ceph.io/ceph-ci/ceph@sha256:874ad160b08ea56a94a9c10c9da918eee8eec002405aef1c3b4a5423f6209448](http://quay.ceph.io/ceph-ci/ceph@sha256:874ad160b08ea56a94a9c10c9da918eee8eec002405aef1c3b4a5423f6209448), name=ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0, io.buildah.version=1.36.0, org.label-schema.license=GPLv2, org.label-schema.vendor=CentOS, GIT_BRANCH=HEAD, ceph=True, RELEASE=reef-a3bbd72, org.label-schema.build-date=20240716, org.label-schema.schema-version=1.0, GIT_REPO=git@github.com:ceph/ceph-container.git, GIT_CLEAN=True, CEPH_POINT_RELEASE=, GIT_COMMIT=c5aaba5e3282b30e4782f2b5d6e4e362e22dfcb7, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.name=CentOS Stream 9 Base Image)
2024-08-02T17:17:22.658 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:22 smithi081 ceph-mon[92715]: Health check failed: 1 osds down (OSD_DOWN)```
2024-08-20T12:17:39.577Z
<Xiubo Li> Hmm, yeah, this should block the IOs
2024-08-20T12:18:24.505Z
<Dhairya Parmar> osd.0 marked itself down and dead
2024-08-20T12:18:38.913Z
<Xiubo Li> it recovered later
2024-08-20T12:18:47.606Z
<Dhairya Parmar> yea
2024-08-20T12:18:56.429Z
<Xiubo Li> But the osd upgrading status kept in progress and last 3 hours
2024-08-20T12:18:57.675Z
<Dhairya Parmar> ```2024-08-02T17:17:50.728 INFO:teuthology.orchestra.run.smithi081.stdout:osd.0                        smithi081                    running (24s)     19s ago   9m    12.1M```
2024-08-20T12:19:23.430Z
<Xiubo Li> though the osd daemons recovered but the IOs didn't
2024-08-20T12:19:32.750Z
<Dhairya Parmar> IOs didn't get flushed
2024-08-20T12:19:43.837Z
<Dhairya Parmar> is this same as some waiter not waking up?
2024-08-20T12:19:55.858Z
<Xiubo Li> Or already flushed but the osd didn't reply and stuck
2024-08-20T12:20:33.744Z
<Xiubo Li> It shouldn't be the same issue, this one is a kclient
2024-08-20T12:21:22.320Z
<Dhairya Parmar> okay,
2024-08-20T12:31:26.133Z
<Igor Golikov> do we have daily meeting right now?
2024-08-20T12:33:52.195Z
<jcollin> @Igor Golikov <https://meet.jit.si/cephfs-standup>
2024-08-20T19:52:37.910Z
<reid.guyett> Are there any built in ops/bw limiters in cephfs?

Any issue? please create an issue here and use the infra label.