ceph - cephfs - 2024-10-14

Timestamp (UTC)	Message
2024-10-14T06:09:11.244Z	<Igor Golikov> Hi folks, I opened my first PR and some tests are failing, e.g. `make check' with weird message ```ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: build stopped: interrupted by user. ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty Build step 'Execute shell' marked build as failure [Cobertura] Skipping Cobertura coverage report as build was not SUCCESS or better ...``` I ran on my local machine the `run-make-check.sh` and all tests passed. From the documentation it looks like `make check` is essentially the `run-make-check.sh` . Any hints?
2024-10-14T06:10:51.849Z	<Igor Golikov> Hi folks, I opened my first PR ( <https://github.com/ceph/ceph/pull/60286> ) and some tests are failing, e.g. `make check' with weird message ```ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: build stopped: interrupted by user. ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty Build step 'Execute shell' marked build as failure [Cobertura] Skipping Cobertura coverage report as build was not SUCCESS or better ...``` I ran on my local machine the `run-make-check.sh` and all tests passed. From the documentation it looks like `make check` is essentially the `run-make-check.sh` . Any hints?
2024-10-14T06:10:53.177Z	<Igor Golikov> Hi folks, I opened my first PR ( <https://github.com/ceph/ceph/pull/60286> ) and some tests are failing, e.g. `make check' with weird message ```ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: build stopped: interrupted by user. ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty Build step 'Execute shell' marked build as failure [Cobertura] Skipping Cobertura coverage report as build was not SUCCESS or better ...``` I ran on my local machine the `run-make-check.sh` and all tests passed. From the documentation it looks like `make check` is essentially the `run-make-check.sh` . Any hints?
2024-10-14T12:05:07.291Z	<John> Hi, I am running a CephFS filesystem on Kubernetes orchestrated using Rook and I am coming up against an issue that I am struggling to debug. The CephFS is running on PVCs, currently using the EBS CSI with IO2 block devices that have 6000 IOPS provisioned. The cluster has come up and is healthy and the file system is successfully being mounted, written to and read from by workloads. The workloads are K8s Pods related to the GitHub Actions Runner Controller. In our case, the containers run in workflows are spun up in separate pods which Kubernetes will schedule to a node in the cluster (sometimes the workflow pod might be scheduled to the same node as the runner pod). For the shared volume to pass files between the runner and the workflow pods, we are using a CephFS volume here and it is mounted with `ReadWriteMany` to allow both pods to write to the volume (this is required). This is where the issue comes in, in one specific case we are seeing an issue reading a file from the shared volume. The runner (which is C#) will write the event context to a JSON file, which the checkout action (which is NodeJs) will read and Parse. The parsing will fail with an error: ```undefined:1 SyntaxError: Unexpected end of JSON input at JSON.parse (<anonymous>) at new Context (/__w/_actions/actions/checkout/1d96c772d19495a3b5c517cd2bc0cb401ea0529f/dist/index.js:4787:37) ...``` [The C# code that writes that file](https://github.com/actions/runner/blob/9b3b554758c26bf2caa8121278ec94a6f0d97de5/src/Runner.Worker/ExecutionContext.cs#L1176) is using `File.WriteAllText(...)`, which closes the file when it completes and it should throw an exception if the return from closing the file is unsuccessful. There are two interesting things here, firstly if the reading pod is scheduled to the same node as the writing pod, then the error reading from the shared volume does not occur. Secondly, is I forked the checkout action to try and debug what might be going on and I tried to log out the contents of the JSON file by updating the code to this: ```// Read and check the content of the file const data = (0, fs_1.readFileSync)(process.env.GITHUB_EVENT_PATH, { encoding: 'utf8' }); console.log(data); // Read the contents of the file from the volume again and parse it this.payload = JSON.parse((0, fs_1.readFileSync)(process.env.GITHUB_EVENT_PATH, { encoding: 'utf8' }));``` Reading the file twice and only attempting to parse the second read succeeds and the error from before is no longer present. The logged data appears to be correct, but if I attempt to parse rather than log it it fails. This double read is not required when using another shared filesystem. Originally we'd been using AWS EFS, but it was very slow, nearly 10 times slower for large checkouts. We have tried: • We were originally using GP3 block devices in the storage class for the PVCs, we changed to IO2 to test having higher IOPs, that didn't help. • I mounted the volumes using various mount options: [read_from_replica=localize, noasyncreaddir, sync], none of them helped • I tried running `sync /path/to/file` from a sidecar container in the writer's pod in a loop to try and diagnose the issue, that didn't help. I am not sure why other files written to the shared volume can be read but this particular one is consistently causing issues. We were not the first ones to encounter this issue using the Actions Runner Controller with Rook and CephFs, someone else encountered this, it appears that they didn't resolve the issue either: <https://github.com/actions/runner-container-hooks/issues/145>. Any advice to help debug this issue would be greatly appreciated!
2024-10-14T12:05:52.263Z	<John> I would also note here, that we have already had a chat about this in the rook slack, where some helpful points were raised (including some of the above tests), but ultimately no fixes have been found. They recommended posting to the ceph slack to see if there was any other information / ideas / suggestions that could help in the journey to debug this issue. Related rook discussion: <https://rook-io.slack.com/archives/CG3HUV94J/p1728460466450799>
2024-10-14T14:51:12.964Z	<gregsfortytwo> do you see what the difference is between the first and second read? We’ve had a similar report but I can’t remember if it’s been resolved, or just another one I’ve been confusing with it

ceph - cephfs - 2024-10-14

Any issue? please create an issue here and use the infra label.