2024-08-20T06:57:03.802Z | <Igor Golikov> Hi team, excited to join cephfs! looking forward to work with you all! |
2024-08-20T06:58:03.879Z | <Venky Shankar> Welcome @Igor Golikov |
2024-08-20T06:59:02.460Z | <Xiubo Li> Welcome! |
2024-08-20T07:19:11.210Z | <jcollin> Hi @Igor Golikov, Welcome! |
2024-08-20T07:45:17.893Z | <Dhairya Parmar> Hey @Igor Golikov, welcome to the team 🙂 |
2024-08-20T09:40:04.017Z | <Rishabh Dave> Welcome! :) |
2024-08-20T10:12:19.643Z | <Venky Shankar> @ixu |
2024-08-20T10:12:23.928Z | <Venky Shankar> @Xiubo Li around? |
2024-08-20T10:12:44.495Z | <Venky Shankar> Could you check <https://pulpito.ceph.com/vshankar-2024-08-14_07:23:44-fs-wip-vshankar-testing-20240814.051955-debug-testing-default-smithi/7854722/> when you are available a bit? |
2024-08-20T10:14:03.479Z | <Venky Shankar> This is the stock kernel in centos9 stream and this failure resembles <https://tracker.ceph.com/issues/48640> for which the kernel fix <https://patchwork.kernel.org/project/ceph-devel/patch/20210106014726.77614-1-xiubli@redhat.com/> should have been ported I think. |
2024-08-20T10:14:19.225Z | <Venky Shankar> So, either the fix isn't in the stock kernel or this is a new failure. |
2024-08-20T10:31:56.081Z | <Xiubo Li> Hi Venky, checking |
2024-08-20T10:44:43.576Z | <Xiubo Li> ```2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:diff --git a/debian/ceph-mds.postinst b/debian/ceph-mds.postinst
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:index dfe02d2308e..e69de29bb2d 100644
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:--- a/debian/ceph-mds.postinst
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:+++ b/debian/ceph-mds.postinst
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:@@ -1,42 +0,0 @@
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-#!/bin/sh
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-# postinst script for ceph-mds
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-#
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-# see: dh_installdeb(1)
2024-08-16T12:54:57.928 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-set -e
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# summary of how this script can be called:
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# postinst configure <most-recently-configured-version>
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# old-postinst abort-upgrade <new-version>
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# conflictor's-postinst abort-remove in-favour <package> <new-version>
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# postinst abort-remove
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# deconfigured's-postinst abort-deconfigure in-favour <failed-install-package> <version> [<removing conflicting-package> <version>]
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-#
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# for details, see <http://www.debian.org/doc/debian-policy/> or
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-# the debian-policy package
2024-08-16T12:54:57.929 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-case "$1" in
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- configure)
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- start ceph-mds-all || :
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- ;;
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- abort-upgrade|abort-remove|abort-deconfigure)
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- :
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- ;;
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- *)
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- echo "postinst called with unknown argument \`$1'" >&2
2024-08-16T12:54:57.930 INFO:tasks.workunit.client.1.smithi179.stdout:- exit 1
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:- ;;
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-esac
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-# dh_installdeb will replace this with shell code automatically
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-# generated by other debhelper scripts.
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-#DEBHELPER#
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-exit 0
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.931 INFO:tasks.workunit.client.1.smithi179.stdout:-
2024-08-16T12:54:57.932 DEBUG:teuthology.orchestra.run:got remote process result: 1``` |
2024-08-20T10:46:13.353Z | <Xiubo Li> The stock kernel have already included the kernel patch and this should be a new one. |
2024-08-20T10:46:40.997Z | <Xiubo Li> Please create a new tracker and assigned it to me. I will have a look this week or later |
2024-08-20T11:01:34.231Z | <Igor Golikov> Hey team, for some reason I have been assigned a MacBook pro, and I realize that there is no way to build CEPH on mac natively. Does anyone use MBP here? with any virtualization SW to run linux on it? |
2024-08-20T11:03:27.460Z | <Rishabh Dave> i don't think anyones use MBP now, but a previous team member had found/built a way to do so. but i'm unaware of the details. |
2024-08-20T11:03:36.823Z | <Rishabh Dave> i don't think anyones use MBP now, but (IIRC) a previous team member had found/built a way to do so. but i'm unaware of the details. |
2024-08-20T11:07:58.517Z | <Dhairya Parmar> Building ceph is stuck at `Performing download step (download, verify and extract) for 'Boost'` since like half an hour now. Anyone got any idea? |
2024-08-20T11:09:00.446Z | <Xiubo Li> It seems stuck downloading the dependent repos. |
2024-08-20T11:10:03.305Z | <Dhairya Parmar> @Rishabh Dave once told me that if i have the required boost tar then it should be done, which i think i have
```dparmar:src$ ls
Boost boost_1_82_0.tar.bz2 boost_1_85_0.tar.bz2 Boost-build Boost-stamp ex-Boost1234```
😕 |
2024-08-20T11:10:59.820Z | <Rishabh Dave> can you copy output of `ls -lh` for same dir? |
2024-08-20T11:11:06.042Z | <Dhairya Parmar> time to run fedora in MBP XD |
2024-08-20T11:11:14.785Z | <Rishabh Dave> it'll tell us if tar copied is of correct size. |
2024-08-20T11:11:36.746Z | <Rishabh Dave> i'll compare it with tar.bz2 files in my work repo. |
2024-08-20T11:11:38.350Z | <Xiubo Li> You can try to remove them and try again. Before I hit the similar issues several time. I resolved this by downloading them munaually. Some times It will work after deleting the incomplete repo and retrying it. |
2024-08-20T11:11:57.670Z | <Dhairya Parmar> okay |
2024-08-20T11:12:05.508Z | <Dhairya Parmar> so it does seem like pretty slow download |
2024-08-20T11:12:11.176Z | <Dhairya Parmar> ```-rw-r--r--. 1 dparmar dparmar 7.1M Aug 20 16:41 boost_1_85_0.tar.bz2``` |
2024-08-20T11:12:28.786Z | <Rishabh Dave> file size is too small |
2024-08-20T11:12:42.251Z | <Rishabh Dave> i guess you didn't cancel ninja command before copying... |
2024-08-20T11:13:00.832Z | <Rishabh Dave> and therefore ninja ended up over-writing the tar file |
2024-08-20T11:13:09.098Z | <Xiubo Li> When the repo sever may also slow for some reason, just try to retry it after removing the incomplete repo. |
2024-08-20T11:13:30.757Z | <Dhairya Parmar> okay finally...... |
2024-08-20T11:13:40.042Z | <Dhairya Parmar> it had moved ... |
2024-08-20T11:13:49.503Z | <Dhairya Parmar> it has moved ... |
2024-08-20T11:14:02.927Z | <Rishabh Dave> nice. |
2024-08-20T11:14:32.373Z | <Dhairya Parmar> lets host a server for this, i'll contribute some bucks XD |
2024-08-20T11:15:18.004Z | <Rishabh Dave> usually cancel `ninja` command, copy , copy boost tar and then run `ninja` command again. |
2024-08-20T11:15:18.468Z | <Xiubo Li> ```2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout:{
2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout: "target_image": "[quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6](http://quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6)",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "in_progress": true,
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "which": "Upgrading all daemon types on all hosts",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "services_complete": [
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "crash",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "osd",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "mgr",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "mon"
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: ],
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "progress": "12/23 daemons upgraded",
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "message": "Currently upgrading osd daemons",
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "is_paused": false
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:}```
Anyone knows does this mean that the osd daemons' upgrading is stuck ? This status lasted for around 3 hours and no any progress. |
2024-08-20T11:15:40.508Z | <Dhairya Parmar> wow |
2024-08-20T11:16:09.043Z | <Xiubo Li> ```2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout:{
2024-08-02T20:13:40.226 INFO:teuthology.orchestra.run.smithi081.stdout: "target_image": "quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "in_progress": true,
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "which": "Upgrading all daemon types on all hosts",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "services_complete": [
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "crash",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "osd",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "mgr",
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: "mon"
2024-08-02T20:13:40.227 INFO:teuthology.orchestra.run.smithi081.stdout: ],
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "progress": "12/23 daemons upgraded",
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "message": "Currently upgrading osd daemons",
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout: "is_paused": false
2024-08-02T20:13:40.228 INFO:teuthology.orchestra.run.smithi081.stdout:}```
Anyone knows does this mean that the osd daemons' upgrading is stuck ? This status lasted for around 3 hours and no any progress.
This is possible blocking the IOs. |
2024-08-20T11:16:30.013Z | <Dhairya Parmar> Anything in OSD logs? |
2024-08-20T11:16:58.471Z | <Xiubo Li> No any useful logs found |
2024-08-20T11:17:34.079Z | <Dhairya Parmar> I've had seen this but for mgr not osd. trying to recall it. |
2024-08-20T11:18:12.299Z | <Xiubo Li> This is one qa failure |
2024-08-20T11:18:28.287Z | <Xiubo Li> After 3 hours it seems the blocked IO just replied |
2024-08-20T11:18:30.132Z | <Igor Golikov> what is XD? I am not familiar with it 🙂 |
2024-08-20T11:18:38.758Z | <Dhairya Parmar> this is upgrade suite right |
2024-08-20T11:18:44.871Z | <Xiubo Li> correct |
2024-08-20T11:18:50.466Z | <Igor Golikov> XD |
2024-08-20T11:18:52.866Z | <Rishabh Dave> XD = 😄 |
2024-08-20T11:18:53.371Z | <Igor Golikov> XD |
2024-08-20T11:19:00.388Z | <Igor Golikov> kidding 🙂 i know |
2024-08-20T11:19:09.297Z | <Rishabh Dave> ok. XD |
2024-08-20T11:19:33.701Z | <Igor Golikov> well the bottom line is - no way to run it without 3rd party SW ... parallel or whatever |
2024-08-20T11:19:40.662Z | <Igor Golikov> thanks, will check it further on |
2024-08-20T11:19:45.907Z | <Xiubo Li> This will block the `Fwb` caps to be released in the cleint side. |
2024-08-20T11:21:01.581Z | <Rishabh Dave> i think most of didn't try because it was not worth the effort. non-MBP works just fine. |
2024-08-20T11:21:48.804Z | <Dhairya Parmar> is this seen regularly? |
2024-08-20T11:25:57.326Z | <Xiubo Li> No, just the first time to see this for me |
2024-08-20T11:26:22.377Z | <Xiubo Li> This tracker <https://tracker.ceph.com/issues/67518> |
2024-08-20T11:31:07.840Z | <Igor Golikov> thats correct, i just dont know why they gave nme MBP 🙂 |
2024-08-20T11:34:19.859Z | <Rishabh Dave> for me IT folks gave me choice between thinkpad and MBP, IIRC. they don't usually know what other team members are using. |
2024-08-20T11:38:04.503Z | <Igor Golikov> they called me to ask, but I had no idea why not to choose MBP (in vmware we got only MBPs with Fusion, so you can run any type of VM on it) |
2024-08-20T11:39:43.289Z | <Venky Shankar> @Igor Golikov Once you have ceph lab access, there is a dedicated node (vossi01) for cephfs devs to build/run ceph clusters (using vstart.sh). |
2024-08-20T11:40:38.120Z | <Dhairya Parmar> do you have cephadm logs? |
2024-08-20T11:41:00.667Z | <Venky Shankar> Amy recent patchs in the testing kernel? |
2024-08-20T11:41:49.995Z | <Dhairya Parmar> right before the upgrade logs i also see this
```2024-08-02T17:20:31.446 INFO:teuthology.orchestra.run.smithi081.stdout:{
2024-08-02T17:20:31.446 INFO:teuthology.orchestra.run.smithi081.stdout: "mon": {
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 2
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: },
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "mgr": {
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 2
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: },
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "osd": {
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 6
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: },
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "mds": {
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 18.2.2-1767-ga3bbd728 (a3bbd7289877bdcce87fd1f79da1a2d6578dde36) reef (stable)": 4
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: },
2024-08-02T17:20:31.447 INFO:teuthology.orchestra.run.smithi081.stdout: "overall": {
2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 18.2.2-1767-ga3bbd728 (a3bbd7289877bdcce87fd1f79da1a2d6578dde36) reef (stable)": 4,
2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout: "ceph version 19.1.0-1260-g26c3fb8e (26c3fb8e197dcf7a49a54d1f4c8a7362ee35a8ea) squid (rc)": 10
2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout: }
2024-08-02T17:20:31.448 INFO:teuthology.orchestra.run.smithi081.stdout:}``` |
2024-08-20T11:42:04.980Z | <Dhairya Parmar> the osds are at `19.1.0-1260-g26c3fb8e` |
2024-08-20T11:42:40.533Z | <Dhairya Parmar> the osds are at `19.1.0-1260-g26c3fb8e` , same with other daemons too apart from mds being at `18.2.2-1767-ga3bbd728` |
2024-08-20T11:43:52.903Z | <Dhairya Parmar> @Xiubo Li i think i found something interesting `2024-08-02T17:20:31.824 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:31 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS` |
2024-08-20T11:44:25.105Z | <Dhairya Parmar> and this is the code
``` if not self.upgrade_state.fail_fs:
if not (mdsmap['in'] == [0] and len(mdsmap['up']) <= 1):
[self.mgr.log.info](http://self.mgr.log.info)(
'Upgrade: Waiting for fs %s to scale down to reach 1 MDS' % (
fs_name))
time.sleep(10)
continue_upgrade = False
continue``` |
2024-08-20T11:44:37.649Z | <Dhairya Parmar> it looks like it was stuck here continuously |
2024-08-20T11:45:46.111Z | <jcollin> Beware of python version changes in a shared environment. I'd like to use a container in such cases. |
2024-08-20T11:45:52.883Z | <Igor Golikov> got it. but dont u have any sandbox version of cluster? to run locally for tests/debug? |
2024-08-20T11:46:59.785Z | <jcollin> [vstart.sh/mstart.sh](http://vstart.sh/mstart.sh) |
2024-08-20T11:48:27.015Z | <jcollin> <https://docs.ceph.com/en/quincy/dev/dev_cluster_deployement/> |
2024-08-20T11:51:02.231Z | <Xiubo Li> The cephadm logs should be in remote/smithi081/log/cephadm.log.gz |
2024-08-20T11:51:11.901Z | <Xiubo Li> The cephadm logs should be in `remote/smithi081/log/cephadm.log.gz` |
2024-08-20T11:51:21.651Z | <Rishabh Dave> does `vstart_runner .py`run fine for on vossi machines? last time i tried, it didn't work for me. |
2024-08-20T11:51:49.395Z | <Xiubo Li> It seem the slwo request just block scaling down the fs ? |
2024-08-20T11:51:57.710Z | <Xiubo Li> It seem the slow request just block scaling down the fs ? |
2024-08-20T11:52:33.250Z | <Dhairya Parmar> > It seem the slow request just block scaling down the fs ?
im just wondering if that is the cause or the symptom |
2024-08-20T11:52:53.359Z | <Xiubo Li> Do you mean the snap related ? |
2024-08-20T11:53:20.482Z | <Dhairya Parmar> If we can compare the time stamp when the first slow request occurred and the the time when we see `Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS` then it would help a bit |
2024-08-20T11:53:39.654Z | <Xiubo Li> yeah, sounds reasonable |
2024-08-20T11:53:44.221Z | <Xiubo Li> let me have a look |
2024-08-20T11:55:09.353Z | <Venky Shankar> yeh |
2024-08-20T11:56:06.640Z | <Dhairya Parmar> and i found another reason
```2024-08-02T17:20:19.659 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:19 smithi081 ceph-mon[92715]: Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)``` |
2024-08-20T11:56:18.206Z | <Xiubo Li> The revoke happened at `2024-08-02T17:14:23`:
```2024-08-02T17:14:23.288+0000 7f2502ad1640 10 mds.0.15 send_message_client_counted client.24283 seq 24071 client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12
2024-08-02T17:14:23.288+0000 7f2502ad1640 1 -- [v2:172.21.15.81:6828/1710115680,v1:172.21.15.81:6829/1710115680] --> v1:172.21.15.81:0/3322994407 -- client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12 -- 0x560e59b78380 con 0x560e530e4000```
And then stuck.
Then the `fs` scale down happened 6 minutes later:
```2024-08-02T17:20:21.660 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS``` |
2024-08-20T11:57:26.741Z | <Xiubo Li> The revoke happened at `2024-08-02T17:14:23`:
```2024-08-02T17:14:23.288+0000 7f2502ad1640 10 mds.0.15 send_message_client_counted client.24283 seq 24071 client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12
2024-08-02T17:14:23.288+0000 7f2502ad1640 1 -- [v2:172.21.15.81:6828/1710115680,v1:172.21.15.81:6829/1710115680] --> v1:172.21.15.81:0/3322994407 -- client_caps(revoke ino 0x100000064c5 1 seq 6 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFxcwb follows 0 size 1014371/4194304 ts 1/18446744073709551615 mtime 2024-08-02T17:14:21.659605+0000 ctime 2024-08-02T17:14:22.368591+0000 change_attr 2) v12 -- 0x560e59b78380 con 0x560e530e4000```
And then stuck.
Then the `fs` scale down happened 6 minutes later:
```2024-08-02T17:20:21.658 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: stopping daemon mds.cephfs.smithi110.ttqwpb
...
2024-08-02T17:20:21.660 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:21 smithi081 ceph-mon[92715]: Upgrade: Waiting for fs cephfs to scale down to reach 1 MDS``` |
2024-08-20T11:57:42.078Z | <Xiubo Li> No |
2024-08-20T11:57:56.882Z | <Dhairya Parmar> so this means the request stalled the MDS? |
2024-08-20T11:58:12.642Z | <Xiubo Li> I think so and need to confirm |
2024-08-20T11:58:20.920Z | <Venky Shankar> The last run essentially had a bug in mon_thrash task that caused most fs:thrash to not run in its entirety. |
2024-08-20T11:58:50.865Z | <Venky Shankar> And this branch has that fix in place, so we probably missed this earlier.. |
2024-08-20T11:59:00.200Z | <Dhairya Parmar> @Xiubo Li
```2024-08-02T17:20:08.158 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:20:07 smithi081 ceph-mon[92715]: Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)``` |
2024-08-20T11:59:02.107Z | <Venky Shankar> anyway, I'll check and create tracker. |
2024-08-20T11:59:14.310Z | <Venky Shankar> works just fine for me |
2024-08-20T11:59:20.629Z | <Dhairya Parmar> @Xiubo Li
```2024-08-02T17:20:07.769 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:20:07 smithi110 ceph-mon[81030]: Health check failed: 1 osds down (OSD_DOWN)
2024-08-02T17:20:07.769 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:20:07 smithi110 ceph-mon[81030]: Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)``` |
2024-08-20T11:59:48.702Z | <Dhairya Parmar> IMO the `require_osd_release` needs to be updated |
2024-08-20T12:02:59.469Z | <Xiubo Li> Okay, let me check |
2024-08-20T12:03:01.431Z | <Xiubo Li> ```2024-08-02T17:32:32.519+0000 7f2879e1d640 5 mds.1.7 shutdown_pass=false
2024-08-02T17:32:32.519+0000 7f2879e1d640 20 mds.beacon.cephfs.smithi110.ttqwpb 1 slow request found
2024-08-02T17:32:32.589+0000 7f2879e1d640 20 mds.1.7 get_task_status
2024-08-02T17:32:32.589+0000 7f2879e1d640 20 mds.1.7 schedule_update_timer_task
2024-08-02T17:32:32.619+0000 7f287961c640 5 mds.beacon.cephfs.smithi110.ttqwpb Sending beacon up:stopping seq 308``` |
2024-08-20T12:03:20.066Z | <Xiubo Li> It was the slow request just blocked the scaling down |
2024-08-20T12:03:45.616Z | <Dhairya Parmar> so the `require_osd_release` is not a blocker right |
2024-08-20T12:03:51.360Z | <Dhairya Parmar> which turned down the osd |
2024-08-20T12:05:18.483Z | <Xiubo Li> The cap revoke stuck and then the slow request happened. The cap revoke caused by the `Fwb` caps, which is waiting the data to be flushed to the Rados. |
2024-08-20T12:05:37.265Z | <Xiubo Li> So I need to confirm whether the osd down issue caused the IO blocking |
2024-08-20T12:07:54.522Z | <Xiubo Li> sure |
2024-08-20T12:12:27.371Z | <Dhairya Parmar> @Xiubo Li the slow request warning is only seen after OSD upgrade starts:
```2024-08-02T17:17:19.543 INFO:teuthology.orchestra.run.smithi081.stderr:2024-08-02T17:17:19.542+0000 7f60deffd640 1 -- 172.21.15.81:0/2869246966 <== mgr.34104 v2:172.21.15.81:6800/3996574518 1 ==== mgr_command_reply(tid 0: 0 ) v1 ==== 8+0+400 (secure 0 0 0) 0x7f60e802c450 con 0x7f60d406e2b0
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout:{
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "target_image": "[quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6](http://quay.ceph.io/ceph-ci/ceph@sha256:6365e5c9c60465c6a86a1efc2ae80339db907d187de8bd717e7c6952210feca6)",
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "in_progress": true,
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "which": "Upgrading all daemon types on all hosts",
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "services_complete": [
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "crash",
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "mgr",
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: "mon"
2024-08-02T17:17:19.544 INFO:teuthology.orchestra.run.smithi081.stdout: ],
2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout: "progress": "6/23 daemons upgraded",
2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout: "message": "Currently upgrading osd daemons",
2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout: "is_paused": false
2024-08-02T17:17:19.545 INFO:teuthology.orchestra.run.smithi081.stdout:}``` |
2024-08-20T12:13:18.832Z | <Xiubo Li> Yeah, so I just suspect the osd upgrading blocked the IOs |
2024-08-20T12:14:14.172Z | <Rishabh Dave> okay, i had tried running couple of weeks ago and it wasn't working for me. IIRC vstart.sh failed everytime i tried. i'll try again... |
2024-08-20T12:14:21.339Z | <Xiubo Li> Yeah, so I just suspect the osd upgrading blocked the IOs, which is the dirty data write back |
2024-08-20T12:14:21.414Z | <Dhairya Parmar> So as soon the upgrade started, the client req couldn't be flushed |
2024-08-20T12:15:11.083Z | <Xiubo Li> it should be the dirty data flushing from buffer is blocked |
2024-08-20T12:15:11.900Z | <Dhairya Parmar> before the first slow req warning i see this
```2024-08-02T17:17:27.946 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:27 smithi081 ceph-mon[92715]: pgmap v43: 65 pgs: 34 active+undersized+degraded, 31 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 0 B/s rd, 3 op/s; 5168/34245 objects degraded (15.091%)``` |
2024-08-20T12:15:46.415Z | <Dhairya Parmar> > it should be the dirty data flushing from buffer is blocked
exactly! the objects are degraded |
2024-08-20T12:16:13.419Z | <Dhairya Parmar> see this
```2024-08-02T17:17:28.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:27 smithi110 ceph-mon[81030]: Health check failed: Degraded data redundancy: 5168/34245 objects degraded (15.091%), 34 pgs degraded (PG_DEGRADED)
2024-08-02T17:17:28.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:27 smithi110 ceph-mon[81030]: Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 17 pgs peering)
2024-08-02T17:17:28.967 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:28 smithi081 ceph-mon[92715]: Health check failed: 2 MDSs report slow requests (MDS_SLOW_REQUEST)``` |
2024-08-20T12:17:30.464Z | <Dhairya Parmar> and the degradation is because an osd had died
```2024-08-02T17:17:21.602 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0[48060]: 2024-08-02T17:17:21.171+0000 7fbf58d9a640 -1 osd.0 50 *** Got signal Terminated ***
2024-08-02T17:17:21.602 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0[48060]: 2024-08-02T17:17:21.171+0000 7fbf58d9a640 -1 osd.0 50 *** Immediate shutdown (osd_fast_shutdown=true) ***
2024-08-02T17:17:21.908 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-mon[92715]: pgmap v38: 65 pgs: 65 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 1.7 KiB/s rd, 3 op/s
2024-08-02T17:17:21.908 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:21 smithi081 ceph-mon[92715]: osd.0 marked itself down and dead
2024-08-02T17:17:22.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:21 smithi110 ceph-mon[81030]: pgmap v38: 65 pgs: 65 active+clean; 2.2 GiB data, 6.9 GiB used, 530 GiB / 536 GiB avail; 1.7 KiB/s rd, 3 op/s
2024-08-02T17:17:22.019 INFO:journalctl@ceph.mon.smithi110.smithi110.stdout:Aug 02 17:17:21 smithi110 ceph-mon[81030]: osd.0 marked itself down and dead
2024-08-02T17:17:22.382 INFO:journalctl@ceph.osd.0.smithi081.stdout:Aug 02 17:17:22 smithi081 podman[101253]: 2024-08-02 17:17:22.069729155 +0000 UTC m=+1.005376791 container died 6d71c06ca77f31c078d37dcfc7db45a2f9b4ffd8ea6eacddd8e36932d96cb6ac (image=[quay.ceph.io/ceph-ci/ceph@sha256:874ad160b08ea56a94a9c10c9da918eee8eec002405aef1c3b4a5423f6209448](http://quay.ceph.io/ceph-ci/ceph@sha256:874ad160b08ea56a94a9c10c9da918eee8eec002405aef1c3b4a5423f6209448), name=ceph-4a88f176-50f1-11ef-bcca-c7b262605968-osd-0, io.buildah.version=1.36.0, org.label-schema.license=GPLv2, org.label-schema.vendor=CentOS, GIT_BRANCH=HEAD, ceph=True, RELEASE=reef-a3bbd72, org.label-schema.build-date=20240716, org.label-schema.schema-version=1.0, GIT_REPO=git@github.com:ceph/ceph-container.git, GIT_CLEAN=True, CEPH_POINT_RELEASE=, GIT_COMMIT=c5aaba5e3282b30e4782f2b5d6e4e362e22dfcb7, maintainer=Guillaume Abrioux <gabrioux@redhat.com>, org.label-schema.name=CentOS Stream 9 Base Image)
2024-08-02T17:17:22.658 INFO:journalctl@ceph.mon.smithi081.smithi081.stdout:Aug 02 17:17:22 smithi081 ceph-mon[92715]: Health check failed: 1 osds down (OSD_DOWN)``` |
2024-08-20T12:17:39.577Z | <Xiubo Li> Hmm, yeah, this should block the IOs |
2024-08-20T12:18:24.505Z | <Dhairya Parmar> osd.0 marked itself down and dead |
2024-08-20T12:18:38.913Z | <Xiubo Li> it recovered later |
2024-08-20T12:18:47.606Z | <Dhairya Parmar> yea |
2024-08-20T12:18:56.429Z | <Xiubo Li> But the osd upgrading status kept in progress and last 3 hours |
2024-08-20T12:18:57.675Z | <Dhairya Parmar> ```2024-08-02T17:17:50.728 INFO:teuthology.orchestra.run.smithi081.stdout:osd.0 smithi081 running (24s) 19s ago 9m 12.1M``` |
2024-08-20T12:19:23.430Z | <Xiubo Li> though the osd daemons recovered but the IOs didn't |
2024-08-20T12:19:32.750Z | <Dhairya Parmar> IOs didn't get flushed |
2024-08-20T12:19:43.837Z | <Dhairya Parmar> is this same as some waiter not waking up? |
2024-08-20T12:19:55.858Z | <Xiubo Li> Or already flushed but the osd didn't reply and stuck |
2024-08-20T12:20:33.744Z | <Xiubo Li> It shouldn't be the same issue, this one is a kclient |
2024-08-20T12:21:22.320Z | <Dhairya Parmar> okay, |
2024-08-20T12:31:26.133Z | <Igor Golikov> do we have daily meeting right now? |
2024-08-20T12:33:52.195Z | <jcollin> @Igor Golikov <https://meet.jit.si/cephfs-standup> |
2024-08-20T19:52:37.910Z | <reid.guyett> Are there any built in ops/bw limiters in cephfs? |