ceph - cephfs - 2024-10-14

Timestamp (UTC)Message
2024-10-14T06:09:11.244Z
<Igor Golikov> Hi folks, I opened my first PR and some tests are  failing, e.g. `make check' with weird message
```ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: build stopped: interrupted by user.
ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
Build step 'Execute shell' marked build as failure
[Cobertura] Skipping Cobertura coverage report as build was not SUCCESS or better ...```
I ran on my local machine the `run-make-check.sh` and all tests passed. From the documentation it looks like `make check` is essentially the `run-make-check.sh` .  Any hints?
2024-10-14T06:10:51.849Z
<Igor Golikov> Hi folks, I opened my first PR ( <https://github.com/ceph/ceph/pull/60286> ) and some tests are  failing, e.g. `make check' with weird message
```ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: build stopped: interrupted by user.
ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
Build step 'Execute shell' marked build as failure
[Cobertura] Skipping Cobertura coverage report as build was not SUCCESS or better ...```
I ran on my local machine the `run-make-check.sh` and all tests passed. From the documentation it looks like `make check` is essentially the `run-make-check.sh` .  Any hints?
2024-10-14T06:10:53.177Z
<Igor Golikov> Hi folks, I opened my first PR ( <https://github.com/ceph/ceph/pull/60286> ) and some tests are  failing, e.g. `make check' with weird message
```ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: build stopped: interrupted by user.
ninja: error: remove(../src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
ninja: error: remove(/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules): Directory not empty
Build step 'Execute shell' marked build as failure
[Cobertura] Skipping Cobertura coverage report as build was not SUCCESS or better ...```
I ran on my local machine the `run-make-check.sh` and all tests passed. From the documentation it looks like `make check` is essentially the `run-make-check.sh` .  Any hints?
2024-10-14T12:05:07.291Z
<John> Hi,

I am running a CephFS filesystem on Kubernetes orchestrated using Rook and I am coming up against an issue
that I am struggling to debug.

The CephFS is running on PVCs, currently using the EBS CSI with IO2 block devices that have 6000 IOPS provisioned.

The cluster has come up and is healthy and the file system is successfully being mounted, written to and read
from by workloads. The workloads are K8s Pods related to the GitHub Actions Runner Controller. In our case, the containers
run in workflows are spun up in separate pods which Kubernetes will schedule to a node in the cluster (sometimes the
workflow pod might be scheduled to the same node as the runner pod). For the shared volume to pass files between the
runner and the workflow pods, we are using a CephFS volume here and it is mounted with `ReadWriteMany` to allow both
pods to write to the volume (this is required).

This is where the issue comes in, in one specific case we are seeing an issue reading a file from the shared volume. The
runner (which is C#) will write the event context to a JSON file, which the checkout action (which is NodeJs) will read
and Parse. The parsing will fail with an error:

```undefined:1

SyntaxError: Unexpected end of JSON input
    at JSON.parse (<anonymous>)
    at new Context (/__w/_actions/actions/checkout/1d96c772d19495a3b5c517cd2bc0cb401ea0529f/dist/index.js:4787:37)
    ...```
[The C# code that writes that file](https://github.com/actions/runner/blob/9b3b554758c26bf2caa8121278ec94a6f0d97de5/src/Runner.Worker/ExecutionContext.cs#L1176) is using `File.WriteAllText(...)`, which closes the file when it completes and it
should throw an exception if the return from closing the file is unsuccessful.

There are two interesting things here, firstly if the reading pod is scheduled to the same node as the writing pod, then
the error reading from the shared volume does not occur. Secondly, is I forked the checkout action to try and debug what
might be going on and I tried to log out the contents of the JSON file by updating the code to this:

```// Read and check the content of the file
const data = (0, fs_1.readFileSync)(process.env.GITHUB_EVENT_PATH, { encoding: 'utf8' });
console.log(data);

// Read the contents of the file from the volume again and parse it
this.payload = JSON.parse((0, fs_1.readFileSync)(process.env.GITHUB_EVENT_PATH, { encoding: 'utf8' }));```
Reading the file twice and only attempting to parse the second read succeeds and the error from before is no longer
present. The logged data appears to be correct, but if I attempt to parse rather than log it it fails.
This double read is not required when using another shared filesystem. Originally we'd been using AWS EFS, but it was
very slow, nearly 10 times slower for large checkouts.

We have tried:
• We were originally using GP3 block devices in the storage class for the PVCs, we changed to IO2 to test having higher IOPs, that didn't help.
• I mounted the volumes using various mount options: [read_from_replica=localize, noasyncreaddir, sync], none of them helped
• I tried running `sync /path/to/file` from a sidecar container in the writer's pod in a loop to try and diagnose the issue, that didn't help.
I am not sure why other files written to the shared volume can be read but this particular one is consistently causing issues. We were not the
first ones to encounter this issue using the Actions Runner Controller with Rook and CephFs, someone else encountered this, it appears that
they didn't resolve the issue either: <https://github.com/actions/runner-container-hooks/issues/145>.

Any advice to help debug this issue would be greatly appreciated!
2024-10-14T12:05:52.263Z
<John> I would also note here, that we have already had a chat about this in the rook slack, where some helpful points were raised (including some of the above tests), but ultimately no fixes have been found. They recommended posting to the ceph slack to see if there was any other information / ideas / suggestions that could help in the journey to debug this issue.

Related rook discussion: <https://rook-io.slack.com/archives/CG3HUV94J/p1728460466450799>
2024-10-14T14:51:12.964Z
<gregsfortytwo> do you see what the difference is between the first and second read?
We’ve had a similar report but I can’t remember if it’s been resolved, or just another one I’ve been confusing with it

Any issue? please create an issue here and use the infra label.