ceph - cephfs - 2024-08-20

Timestamp (UTC)	Message
2024-08-20T06:57:03.802Z	<Igor Golikov> Hi team, excited to join cephfs! looking forward to work with you all!
2024-08-20T06:58:03.879Z	<Venky Shankar> Welcome @Igor Golikov
2024-08-20T06:59:02.460Z	<Xiubo Li> Welcome!
2024-08-20T07:19:11.210Z	<jcollin> Hi @Igor Golikov, Welcome!
2024-08-20T07:45:17.893Z	<Dhairya Parmar> Hey @Igor Golikov, welcome to the team 🙂
2024-08-20T09:40:04.017Z	<Rishabh Dave> Welcome! :)
2024-08-20T10:12:19.643Z	<Venky Shankar> @ixu
2024-08-20T10:12:23.928Z	<Venky Shankar> @Xiubo Li around?
2024-08-20T10:12:44.495Z	<Venky Shankar> Could you check <https://pulpito.ceph.com/vshankar-2024-08-14_07:23:44-fs-wip-vshankar-testing-20240814.051955-debug-testing-default-smithi/7854722/> when you are available a bit?
2024-08-20T10:14:03.479Z	<Venky Shankar> This is the stock kernel in centos9 stream and this failure resembles <https://tracker.ceph.com/issues/48640> for which the kernel fix <https://patchwork.kernel.org/project/ceph-devel/patch/20210106014726.77614-1-xiubli@redhat.com/> should have been ported I think.
2024-08-20T10:14:19.225Z	<Venky Shankar> So, either the fix isn't in the stock kernel or this is a new failure.
2024-08-20T10:31:56.081Z	<Xiubo Li> Hi Venky, checking
2024-08-20T10:44:43.576Z	<Xiubo Li> ```2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:diff --git a/debian/ceph-mds.postinst b/debian/ceph-mds.postinst 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:index dfe02d2308e..e69de29bb2d 100644 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:--- a/debian/ceph-mds.postinst 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:+++ b/debian/ceph-mds.postinst 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:@@ -1,42 +0,0 @@ 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-#!/bin/sh 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-# postinst script for ceph-mds 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-# 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-# see: dh_installdeb(1) 2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-set -e 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# summary of how this script can be called: 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# postinst configure <most-recently-configured-version> 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# old-postinst abort-upgrade <new-version> 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# conflictor's-postinst abort-remove in-favour <package> <new-version> 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# postinst abort-remove 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# deconfigured's-postinst abort-deconfigure in-favour <failed-install-package> <version> [<removing conflicting-package> <version>] 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# for details, see <http://www.debian.org/doc/debian-policy/> or 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# the debian-policy package 2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-case "$1" in 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- configure) 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- start ceph-mds-all \|\| : 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- ;; 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- abort-upgrade\|abort-remove\|abort-deconfigure) 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- : 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- ;; 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- *) 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- echo "postinst called with unknown argument \`$1'" >&2 2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- exit 1 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:- ;; 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-esac 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-# dh_installdeb will replace this with shell code automatically 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-# generated by other debhelper scripts. 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-#DEBHELPER# 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-exit 0 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:- 2024-08-16T12:54:57.932 DEBUG:teuthology.orchestra.run:got remote process result: 1```
2024-08-20T10:46:13.353Z	<Xiubo Li> The stock kernel have already included the kernel patch and this should be a new one.
2024-08-20T10:46:40.997Z	<Xiubo Li> Please create a new tracker and assigned it to me. I will have a look this week or later
2024-08-20T11:01:34.231Z	<Igor Golikov> Hey team, for some reason I have been assigned a MacBook pro, and I realize that there is no way to build CEPH on mac natively. Does anyone use MBP here? with any virtualization SW to run linux on it?
2024-08-20T11:03:27.460Z	<Rishabh Dave> i don't think anyones use MBP now, but a previous team member had found/built a way to do so. but i'm unaware of the details.
2024-08-20T11:03:36.823Z	<Rishabh Dave> i don't think anyones use MBP now, but (IIRC) a previous team member had found/built a way to do so. but i'm unaware of the details.
2024-08-20T11:07:58.517Z	<Dhairya Parmar> Building ceph is stuck at `Performing download step (download, verify and extract) for 'Boost'` since like half an hour now. Anyone got any idea?
2024-08-20T11:09:00.446Z	<Xiubo Li> It seems stuck downloading the dependent repos.
2024-08-20T11:10:03.305Z	<Dhairya Parmar> @Rishabh Dave once told me that if i have the required boost tar then it should be done, which i think i have ```dparmar:src$ ls Boost boost_1_82_0.tar.bz2 boost_1_85_0.tar.bz2 Boost-build Boost-stamp ex-Boost1234``` 😕
2024-08-20T11:10:59.820Z	<Rishabh Dave> can you copy output of `ls -lh` for same dir?
2024-08-20T11:11:06.042Z	<Dhairya Parmar> time to run fedora in MBP XD
2024-08-20T11:11:14.785Z	<Rishabh Dave> it'll tell us if tar copied is of correct size.
2024-08-20T11:11:36.746Z	<Rishabh Dave> i'll compare it with tar.bz2 files in my work repo.
2024-08-20T11:11:38.350Z	<Xiubo Li> You can try to remove them and try again. Before I hit the similar issues several time. I resolved this by downloading them munaually. Some times It will work after deleting the incomplete repo and retrying it.
2024-08-20T11:11:57.670Z	<Dhairya Parmar> okay
2024-08-20T11:12:05.508Z	<Dhairya Parmar> so it does seem like pretty slow download
2024-08-20T11:12:11.176Z	<Dhairya Parmar> ```-rw-r--r--. 1 dparmar dparmar 7.1M Aug 20 16:41 boost_1_85_0.tar.bz2```
2024-08-20T11:12:28.786Z	<Rishabh Dave> file size is too small
2024-08-20T11:12:42.251Z	<Rishabh Dave> i guess you didn't cancel ninja command before copying...
2024-08-20T11:13:00.832Z	<Rishabh Dave> and therefore ninja ended up over-writing the tar file
2024-08-20T11:13:09.098Z	<Xiubo Li> When the repo sever may also slow for some reason, just try to retry it after removing the incomplete repo.
2024-08-20T11:13:30.757Z	<Dhairya Parmar> okay finally......
2024-08-20T11:13:40.042Z	<Dhairya Parmar> it had moved ...
2024-08-20T11:13:49.503Z	<Dhairya Parmar> it has moved ...
2024-08-20T11:14:02.927Z	<Rishabh Dave> nice.
2024-08-20T11:14:32.373Z	<Dhairya Parmar> lets host a server for this, i'll contribute some bucks XD
2024-08-20T11:15:18.004Z	<Rishabh Dave> usually cancel `ninja` command, copy , copy boost tar and then run `ninja` command again.
2024-08-20T11:15:18.468Z	<Xiubo Li> ```2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout:{ 2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout: "target_image": "[quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6](http://quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6)", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "in_progress": true, 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "which": "Upgrading all daemon types on all hosts", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "services_complete": [ 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "crash", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "osd", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "mgr", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "mon" 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: ], 2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "progress": "12/23 daemons upgraded", 2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "message": "Currently upgrading osd daemons", 2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "is_paused": false 2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:}``` Anyone knows does this mean that the osd daemons' upgrading is stuck ? This status lasted for around 3 hours and no any progress.
2024-08-20T11:15:40.508Z	<Dhairya Parmar> wow
2024-08-20T11:16:09.043Z	<Xiubo Li> ```2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout:{ 2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout: "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "in_progress": true, 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "which": "Upgrading all daemon types on all hosts", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "services_complete": [ 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "crash", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "osd", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "mgr", 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "mon" 2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: ], 2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "progress": "12/23 daemons upgraded", 2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "message": "Currently upgrading osd daemons", 2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "is_paused": false 2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:}``` Anyone knows does this mean that the osd daemons' upgrading is stuck ? This status lasted for around 3 hours and no any progress. This is possible blocking the IOs.
2024-08-20T11:16:30.013Z	<Dhairya Parmar> Anything in OSD logs?
2024-08-20T11:16:58.471Z	<Xiubo Li> No any useful logs found
2024-08-20T11:17:34.079Z	<Dhairya Parmar> I've had seen this but for mgr not osd. trying to recall it.
2024-08-20T11:18:12.299Z	<Xiubo Li> This is one qa failure
2024-08-20T11:18:28.287Z	<Xiubo Li> After 3 hours it seems the blocked IO just replied
2024-08-20T11:18:30.132Z	<Igor Golikov> what is XD? I am not familiar with it 🙂
2024-08-20T11:18:38.758Z	<Dhairya Parmar> this is upgrade suite right
2024-08-20T11:18:44.871Z	<Xiubo Li> correct
2024-08-20T11:18:50.466Z	<Igor Golikov> XD
2024-08-20T11:18:52.866Z	<Rishabh Dave> XD = 😄
2024-08-20T11:18:53.371Z	<Igor Golikov> XD
2024-08-20T11:19:00.388Z	<Igor Golikov> kidding 🙂 i know
2024-08-20T11:19:09.297Z	<Rishabh Dave> ok. XD
2024-08-20T11:19:33.701Z	<Igor Golikov> well the bottom line is - no way to run it without 3rd party SW ... parallel or whatever
2024-08-20T11:19:40.662Z	<Igor Golikov> thanks, will check it further on
2024-08-20T11:19:45.907Z	<Xiubo Li> This will block the `Fwb` caps to be released in the cleint side.
2024-08-20T11:21:01.581Z	<Rishabh Dave> i think most of didn't try because it was not worth the effort. non-MBP works just fine.
2024-08-20T11:21:48.804Z	<Dhairya Parmar> is this seen regularly?
2024-08-20T11:25:57.326Z	<Xiubo Li> No, just the first time to see this for me
2024-08-20T11:26:22.377Z	<Xiubo Li> This tracker <https://tracker.ceph.com/issues/67518>
2024-08-20T11:31:07.840Z	<Igor Golikov> thats correct, i just dont know why they gave nme MBP 🙂
2024-08-20T11:34:19.859Z	<Rishabh Dave> for me IT folks gave me choice between thinkpad and MBP, IIRC. they don't usually know what other team members are using.
2024-08-20T11:38:04.503Z	<Igor Golikov> they called me to ask, but I had no idea why not to choose MBP (in vmware we got only MBPs with Fusion, so you can run any type of VM on it)
2024-08-20T11:39:43.289Z	<Venky Shankar> @Igor Golikov Once you have ceph lab access, there is a dedicated node (vossi01) for cephfs devs to build/run ceph clusters (using vstart.sh).
2024-08-20T11:40:38.120Z	<Dhairya Parmar> do you have cephadm logs?
2024-08-20T11:41:00.667Z	<Venky Shankar> Amy recent patchs in the testing kernel?
2024-08-20T11:41:49.995Z	<Dhairya Parmar> right before the upgrade logs i also see this ```2024-08-02T17:20:31.446 INFO:teuthology.orchestra.run.smithi081.stdout:{ 2024-08-02T17:20:31.446 INFO:teuthology.orchestra.run.smithi081.stdout: "mon": { 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 2 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: }, 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "mgr": { 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 2 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: }, 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "osd": { 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 6 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: }, 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "mds": { 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 18.2.2-1767-ga3bbd728 (a3bbd7289877bdcce87fd1f79da1a2d6578dde36) reef (stable)": 4 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: }, 2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "overall": { 2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 18.2.2-1767-ga3bbd728 (a3bbd7289877bdcce87fd1f79da1a2d6578dde36) reef (stable)": 4, 2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 10 2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout: } 2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout:}```
2024-08-20T11:42:04.980Z	<Dhairya Parmar> the osds are at `19.1.0-1260-g26c3fb8e`
2024-08-20T11:42:40.533Z	<Dhairya Parmar> the osds are at `19.1.0-1260-g26c3fb8e` , same with other daemons too apart from mds being at `18.2.2-1767-ga3bbd728`
2024-08-20T11:43:52.903Z	<Dhairya Parmar> @Xiubo Li i think i found something interesting `2024-08-02T17:20:31.824 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:31 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS`
2024-08-20T11:44:25.105Z	<Dhairya Parmar> and this is the code ``` if not self.upgrade_state.fail_fs: if not (mdsmap['in'] == [0] and len(mdsmap['up']) <= 1): [self.mgr.log.info](http://self.mgr.log.info)( 'Upgrade: Waiting for fs %s to scale down to reach 1 MDS' % ( fs_name)) time.sleep(10) continue_upgrade = False continue```
2024-08-20T11:44:37.649Z	<Dhairya Parmar> it looks like it was stuck here continuously
2024-08-20T11:45:46.111Z	<jcollin> Beware of python version changes in a shared environment. I'd like to use a container in such cases.
2024-08-20T11:45:52.883Z	<Igor Golikov> got it. but dont u have any sandbox version of cluster? to run locally for tests/debug?
2024-08-20T11:46:59.785Z	<jcollin> [vstart.sh/mstart.sh](http://vstart.sh/mstart.sh)
2024-08-20T11:48:27.015Z	<jcollin> <https://docs.ceph.com/en/quincy/dev/dev_cluster_deployement/>
2024-08-20T11:51:02.231Z	<Xiubo Li> The cephadm logs should be in remote/smithi081/log/cephadm.log.gz
2024-08-20T11:51:11.901Z	<Xiubo Li> The cephadm logs should be in `remote/smithi081/log/cephadm.log.gz`
2024-08-20T11:51:21.651Z	<Rishabh Dave> does `vstart_runner .py`run fine for on vossi machines? last time i tried, it didn't work for me.
2024-08-20T11:51:49.395Z	<Xiubo Li> It seem the slwo request just block scaling down the fs ?
2024-08-20T11:51:57.710Z	<Xiubo Li> It seem the slow request just block scaling down the fs ?
2024-08-20T11:52:33.250Z	<Dhairya Parmar> > It seem the slow request just block scaling down the fs ? im just wondering if that is the cause or the symptom
2024-08-20T11:52:53.359Z	<Xiubo Li> Do you mean the snap related ?
2024-08-20T11:53:20.482Z	<Dhairya Parmar> If we can compare the time stamp when the first slow request occurred and the the time when we see `Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS` then it would help a bit
2024-08-20T11:53:39.654Z	<Xiubo Li> yeah, sounds reasonable
2024-08-20T11:53:44.221Z	<Xiubo Li> let me have a look
2024-08-20T11:55:09.353Z	<Venky Shankar> yeh
2024-08-20T11:56:06.640Z	<Dhairya Parmar> and i found another reason ```2024-08-02T17:20:19.659 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:19 smithi081 ceph-mon[92715]: Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)```
2024-08-20T11:56:18.206Z	<Xiubo Li> The revoke happened at `2024-08-02T17:14:23`: ```2024-08-02T17:14:23.288+0000 7f2502ad1640 10 mds.0.15 send_message_client_counted client.24283 seq 24071 client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12 2024-08-02T17:14:23.288+0000 7f2502ad1640 1 -- [v2:172.21.15.81:6828/1710115680,v1:172.21.15.81:6829/1710115680] --> v1:172.21.15.81:0/3322994407 -- client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12 -- 0x560e59b78380 con 0x560e530e4000``` And then stuck. Then the `fs` scale down happened 6 minutes later: ```2024-08-02T17:20:21.660 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS```
2024-08-20T11:57:26.741Z	<Xiubo Li> The revoke happened at `2024-08-02T17:14:23`: ```2024-08-02T17:14:23.288+0000 7f2502ad1640 10 mds.0.15 send_message_client_counted client.24283 seq 24071 client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12 2024-08-02T17:14:23.288+0000 7f2502ad1640 1 -- [v2:172.21.15.81:6828/1710115680,v1:172.21.15.81:6829/1710115680] --> v1:172.21.15.81:0/3322994407 -- client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12 -- 0x560e59b78380 con 0x560e530e4000``` And then stuck. Then the `fs` scale down happened 6 minutes later: ```2024-08-02T17:20:21.658 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: stopping daemon mds.cephfs.smithi110.ttqwpb ... 2024-08-02T17:20:21.660 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS```
2024-08-20T11:57:42.078Z	<Xiubo Li> No
2024-08-20T11:57:56.882Z	<Dhairya Parmar> so this means the request stalled the MDS?
2024-08-20T11:58:12.642Z	<Xiubo Li> I think so and need to confirm
2024-08-20T11:58:20.920Z	<Venky Shankar> The last run essentially had a bug in mon_thrash task that caused most fs:thrash to not run in its entirety.
2024-08-20T11:58:50.865Z	<Venky Shankar> And this branch has that fix in place, so we probably missed this earlier..
2024-08-20T11:59:00.200Z	<Dhairya Parmar> @Xiubo Li ```2024-08-02T17:20:08.158 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:07 smithi081 ceph-mon[92715]: Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)```
2024-08-20T11:59:02.107Z	<Venky Shankar> anyway, I'll check and create tracker.
2024-08-20T11:59:14.310Z	<Venky Shankar> works just fine for me
2024-08-20T11:59:20.629Z	<Dhairya Parmar> @Xiubo Li ```2024-08-02T17:20:07.769 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:20:07 smithi110 ceph-mon[81030]: Health check failed: 1 osds down (OSD_DOWN) 2024-08-02T17:20:07.769 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:20:07 smithi110 ceph-mon[81030]: Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)```
2024-08-20T11:59:48.702Z	<Dhairya Parmar> IMO the `require_osd_release` needs to be updated
2024-08-20T12:02:59.469Z	<Xiubo Li> Okay, let me check
2024-08-20T12:03:01.431Z	<Xiubo Li> ```2024-08-02T17:32:32.519+0000 7f2879e1d640 5 mds.1.7 shutdown_pass=false 2024-08-02T17:32:32.519+0000 7f2879e1d640 20 mds.beacon.cephfs.smithi110.ttqwpb 1 slow request found 2024-08-02T17:32:32.589+0000 7f2879e1d640 20 mds.1.7 get_task_status 2024-08-02T17:32:32.589+0000 7f2879e1d640 20 mds.1.7 schedule_update_timer_task 2024-08-02T17:32:32.619+0000 7f287961c640 5 mds.beacon.cephfs.smithi110.ttqwpb Sending beacon up:stopping seq 308```
2024-08-20T12:03:20.066Z	<Xiubo Li> It was the slow request just blocked the scaling down
2024-08-20T12:03:45.616Z	<Dhairya Parmar> so the `require_osd_release` is not a blocker right
2024-08-20T12:03:51.360Z	<Dhairya Parmar> which turned down the osd
2024-08-20T12:05:18.483Z	<Xiubo Li> The cap revoke stuck and then the slow request happened. The cap revoke caused by the `Fwb` caps, which is waiting the data to be flushed to the Rados.
2024-08-20T12:05:37.265Z	<Xiubo Li> So I need to confirm whether the osd down issue caused the IO blocking
2024-08-20T12:07:54.522Z	<Xiubo Li> sure
2024-08-20T12:12:27.371Z	<Dhairya Parmar> @Xiubo Li the slow request warning is only seen after OSD upgrade starts: ```2024-08-02T17:17:19.543 INFO:teuthology.orchestra.run.smithi081.stderr:2024-08-02T17:17:19.542+0000 7f60deffd640 1 -- 172.21.15.81:0/2869246966 <== mgr.34104 v2:172.21.15.81:6800/3996574518 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+400 (secure 0 0 0) 0x7f60e802c450 con 0x7f60d406e2b0 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:{ 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "target_image": "[quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6](http://quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6)", 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "in_progress": true, 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "which": "Upgrading all daemon types on all hosts", 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "services_complete": [ 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "crash", 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "mgr", 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "mon" 2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: ], 2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout: "progress": "6/23 daemons upgraded", 2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout: "message": "Currently upgrading osd daemons", 2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout: "is_paused": false 2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout:}```
2024-08-20T12:13:18.832Z	<Xiubo Li> Yeah, so I just suspect the osd upgrading blocked the IOs
2024-08-20T12:14:14.172Z	<Rishabh Dave> okay, i had tried running couple of weeks ago and it wasn't working for me. IIRC vstart.sh failed everytime i tried. i'll try again...
2024-08-20T12:14:21.339Z	<Xiubo Li> Yeah, so I just suspect the osd upgrading blocked the IOs, which is the dirty data write back
2024-08-20T12:14:21.414Z	<Dhairya Parmar> So as soon the upgrade started, the client req couldn't be flushed
2024-08-20T12:15:11.083Z	<Xiubo Li> it should be the dirty data flushing from buffer is blocked
2024-08-20T12:15:11.900Z	<Dhairya Parmar> before the first slow req warning i see this ```2024-08-02T17:17:27.946 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:27 smithi081 ceph-mon[92715]: pgmap v43: 65 pgs: 34 active+undersized+degraded, 31 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 0 B/s rd, 3 op/s; 5168/34245 objects degraded (15.091%)```
2024-08-20T12:15:46.415Z	<Dhairya Parmar> > it should be the dirty data flushing from buffer is blocked exactly! the objects are degraded
2024-08-20T12:16:13.419Z	<Dhairya Parmar> see this ```2024-08-02T17:17:28.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:27 smithi110 ceph-mon[81030]: Health check failed: Degraded data redundancy: 5168/34245 objects degraded (15.091%), 34 pgs degraded (PG_DEGRADED) 2024-08-02T17:17:28.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:27 smithi110 ceph-mon[81030]: Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 17 pgs peering) 2024-08-02T17:17:28.967 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:28 smithi081 ceph-mon[92715]: Health check failed: 2 MDSs report slow requests (MDS_SLOW_REQUEST)```
2024-08-20T12:17:30.464Z	<Dhairya Parmar> and the degradation is because an osd had died ```2024-08-02T17:17:21.602 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0[48060]: 2024-08-02T17:17:21.171+0000 7fbf58d9a640 -1 osd.0 50 * Got signal Terminated * 2024-08-02T17:17:21.602 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0[48060]: 2024-08-02T17:17:21.171+0000 7fbf58d9a640 -1 osd.0 50 * Immediate shutdown (osd_fast_shutdown=true) * 2024-08-02T17:17:21.908 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-mon[92715]: pgmap v38: 65 pgs: 65 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 1.7 KiB/s rd, 3 op/s 2024-08-02T17:17:21.908 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-mon[92715]: osd.0 marked itself down and dead 2024-08-02T17:17:22.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:21 smithi110 ceph-mon[81030]: pgmap v38: 65 pgs: 65 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 1.7 KiB/s rd, 3 op/s 2024-08-02T17:17:22.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:21 smithi110 ceph-mon[81030]: osd.0 marked itself down and dead 2024-08-02T17:17:22.382 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:22 smithi081 podman[101253]: 2024-08-02 17:17:22.069729155 +0000 UTC m=+1.005376791 container died 6d71c06ca77f31c078d37dcfc7db45a2f9b4ffd8ea6eacddd8e36932d96cb6ac (image=[quay.ceph.io/ceph-ci/ceph@sha256:874ad160b08ea56a94a9c10c9da918eee8eec002405aef1c3b4a5423f6209448](http://quay.ceph.io/ceph-ci/ceph@sha256:874ad160b08ea56a94a9c10c9da918eee8eec002405aef1c3b4a5423f6209448), name=ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0, io.buildah.version=1.36.0, org.label-schema.license=GPLv2, org.label-schema.vendor=CentOS, GIT_BRANCH=HEAD, ceph=True, RELEASE=reef-a3bbd72, org.label-schema.build-date=20240716, org.label-schema.schema-version=1.0, GIT_REPO=git@github.com:ceph/ceph-container.git, GIT_CLEAN=True, CEPH_POINT_RELEASE=, GIT_COMMIT=c5aaba5e3282b30e4782f2b5d6e4e362e22dfcb7, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.name=CentOS Stream 9 Base Image) 2024-08-02T17:17:22.658 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:22 smithi081 ceph-mon[92715]: Health check failed: 1 osds down (OSD_DOWN)```
2024-08-20T12:17:39.577Z	<Xiubo Li> Hmm, yeah, this should block the IOs
2024-08-20T12:18:24.505Z	<Dhairya Parmar> osd.0 marked itself down and dead
2024-08-20T12:18:38.913Z	<Xiubo Li> it recovered later
2024-08-20T12:18:47.606Z	<Dhairya Parmar> yea
2024-08-20T12:18:56.429Z	<Xiubo Li> But the osd upgrading status kept in progress and last 3 hours
2024-08-20T12:18:57.675Z	<Dhairya Parmar> ```2024-08-02T17:17:50.728 INFO:teuthology.orchestra.run.smithi081.stdout:osd.0 smithi081 running (24s) 19s ago 9m 12.1M```
2024-08-20T12:19:23.430Z	<Xiubo Li> though the osd daemons recovered but the IOs didn't
2024-08-20T12:19:32.750Z	<Dhairya Parmar> IOs didn't get flushed
2024-08-20T12:19:43.837Z	<Dhairya Parmar> is this same as some waiter not waking up?
2024-08-20T12:19:55.858Z	<Xiubo Li> Or already flushed but the osd didn't reply and stuck
2024-08-20T12:20:33.744Z	<Xiubo Li> It shouldn't be the same issue, this one is a kclient
2024-08-20T12:21:22.320Z	<Dhairya Parmar> okay,
2024-08-20T12:31:26.133Z	<Igor Golikov> do we have daily meeting right now?
2024-08-20T12:33:52.195Z	<jcollin> @Igor Golikov <https://meet.jit.si/cephfs-standup>
2024-08-20T19:52:37.910Z	<reid.guyett> Are there any built in ops/bw limiters in cephfs?

ceph - cephfs - 2024-08-20

Any issue? please create an issue here and use the infra label.