2024-07-19T07:33:29.718Z | <Ilya Dryomov> @Zack Cerza Re: uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of job where teuthology and/or infrastructure seem to be involved (as opposed to by the test): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to the log, it hung while trying to clone a repo with workunits (i.e. before actually running an tests)
However, that is NOT case -- logging in to `smithi027`, I see the repo cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it |
2024-07-19T07:34:28.522Z | <Ilya Dryomov> @Zack Cerza Re: uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to the log, it hung while trying to clone a repo with workunits (i.e. before actually running any tests)
However, that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it |
2024-07-19T07:37:58.778Z | <Ilya Dryomov> Ceph daemon logs (on the same node -- it's a single node test) stop abruptly 17 minutes after the last line in `teuthology.log`: |
2024-07-19T07:38:16.945Z | <Ilya Dryomov> ```ubuntu@smithi027:/var/log/ceph$ tail -3 ceph-mon.a.log
2024-07-18T23:17:44.373+0000 7fde9ac31640 20 mon.a@0(leader).paxos(paxos active c 754..1263) finish_round waiting_for_readable
2024-07-18T23:17:44.373+0000 7fde9ac31640 20 mon.a@0(leader).paxos(paxos active c 754..1263) finish_round waiting_for_writeable
2024-07-18T23:17:44.373+0000 7fde9ac31640 10 mon.a@0(leader).paxos(paxos active c 754..1263) finish_round done w/ waiters, state active``` |
2024-07-19T07:38:25.347Z | <Ilya Dryomov> ```ubuntu@smithi027:/var/log/ceph$ tail -3 ceph-mgr.x.log
2024-07-18T23:17:44.349+0000 7f6e58df0640 10 auth: id 4 AQBeoJlmCMzXARAAJE1B3pmF+wbT5f3wA6IVLw== expires 2024-07-18T23:16:14.030918+0000
2024-07-18T23:17:44.349+0000 7f6e58df0640 10 auth: id 5 AQBJoZlmnR/BAxAAjhf4C2bXSeC2mDKcVdgKgA== expires 2024-07-18T23:20:14.030918+0000
2024-07-18T23:17:44.349+0000 7f6e58df0640 10 auth: id 6 AQA+oplmCuJWBRAAht+Muc5YZk2hAjXCcGKQqg== expires 2024-07-18T23:24:14.089579+0000``` |
2024-07-19T07:38:36.659Z | <Ilya Dryomov> ```ubuntu@smithi027:/var/log/ceph$ tail -3 ceph-osd.0.log
2024-07-18T23:17:44.377+0000 7fd9d9a38640 20 osd.0 79 get_nextmap_reserved map_reservations: {79=1}
2024-07-18T23:17:44.377+0000 7fd9d9a38640 10 osd.0 79 maybe_share_map: con v2:172.21.15.27:6804/3442971125 our osdmap epoch of 79 is not newer than session's projected_epoch of 79
2024-07-18T23:17:44.377+0000 7fd9d9a38640 1 -- [v2:172.21.15.27:6805/938045435,v1:172.21.15.27:6807/938045435] --> [v2:172.21.15.27:6804/3442971125,v1:172.21.15.27:6806/3442971125] -- MOSDECSubOpWrite(3.3s2 79/19 ECSubWrite(tid=380184, reqid=client.4394.0:547744, at_version=79'65994, trim_to=73'55963, roll_forward_to=79'65993)) -- 0x55e8b0480000 con 0x55e8afb26000``` |
2024-07-19T07:39:18.076Z | <Ilya Dryomov> @Zack Cerza Re: uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to `teuthology.log`, the job hung while trying to clone a repo with workunits (i.e. before actually running any tests)
But that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it |
2024-07-19T07:42:08.986Z | <Ilya Dryomov> These jobs also don't seem to be killable: |
2024-07-19T07:42:11.309Z | <Ilya Dryomov> ```$ teuthology-kill -J dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926
Traceback (most recent call last):
File "/cephfs/home/dis/src/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>
sys.exit(main())
File "/cephfs/home/dis/src/teuthology/scripts/kill.py", line 44, in main
teuthology.kill.main(args)
File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 38, in main
kill_job(run_name, job_id, archive_base, owner)
File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 87, in kill_job
raise RuntimeError(
RuntimeError: I could not figure out the owner of the requested job. Please pass --owner <owner>.``` |
2024-07-19T07:42:27.635Z | <Ilya Dryomov> ```$ teuthology-kill -J dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926 --owner scheduled_dis@teuthology
2024-07-19 07:41:11,409.409 INFO:teuthology.kill:No teuthology processes running
Traceback (most recent call last):
File "/cephfs/home/dis/src/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>
sys.exit(main())
File "/cephfs/home/dis/src/teuthology/scripts/kill.py", line 44, in main
teuthology.kill.main(args)
File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 38, in main
kill_job(run_name, job_id, archive_base, owner)
File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 93, in kill_job
report.try_push_job_info(job_info, dict(status="dead"))
File "/cephfs/home/dis/src/teuthology/teuthology/report.py", line 501, in try_push_job_info
run_name = job_config['name']
KeyError: 'name'``` |
2024-07-19T07:43:18.859Z | <Ilya Dryomov> @Zack Cerza Re: recent uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to `teuthology.log`, the job hung while trying to clone a repo with workunits (i.e. before actually running any tests)
But that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it |
2024-07-19T07:43:31.990Z | <Ilya Dryomov> @Zack Cerza Re: recent uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of a job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to `teuthology.log`, the job hung while trying to clone a repo with workunits (i.e. before actually running any tests)
But that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it |
2024-07-19T07:44:09.541Z | <Ilya Dryomov> @Zack Cerza Re: recent uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of a job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to `teuthology.log`, the job hung while trying to clone a repo with workunits (i.e. before actually running any tests)
But that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of the workunit being run but the cluster apparently getting zapped in the middle of it |
2024-07-19T07:47:25.284Z | <Ilya Dryomov> Ceph daemon logs (on the same node -- it's a single node job) stop abruptly 17 minutes after the last line in `teuthology.log`: |
2024-07-19T07:54:11.281Z | <Ilya Dryomov> ... despite being reported as `Running` and hogging test nodes, of course |
2024-07-19T08:15:37.079Z | <Ilya Dryomov> Ceph daemon logs (on the same node -- it's a single node job) end abruptly 17 minutes after the last line in `teuthology.log`: |
2024-07-19T16:21:12.991Z | <Zack Cerza> when I look at that job's teuthology.log, I don't see it hang where you claim. I see the `git clone` take ~~46s, then the workunit starts, and the hang appears ~~12s after that.
I'm concerned about the errors during the kill attempts. those commands were run on teuthology.front, correct? I ask because that's how it might look if run on the wrong host |
2024-07-19T17:14:27.202Z | <Ilya Dryomov> > when I look at that job's teuthology.log, I don't see it hang where you claim.
I was definitely hung where I claimed -- I checked multiple times over the course of a couple of hours (this was at the 8-9 hour mark of the job)
Something must have flushed when job reached the 12 hour mark, likely as part of it getting killed |
2024-07-19T17:15:12.483Z | <Ilya Dryomov> > those commands were run on teuthology.front, correct? I ask because that's how it might look if run on the wrong host
Yes, they were run on teuthology.front |
2024-07-19T17:16:13.396Z | <Ilya Dryomov> > when I look at that job's teuthology.log, I don't see it hang where you claim.
It was definitely hung at the point that I pasted -- I checked it multiple times over the course of a couple of hours (this was at the 8-9 hour mark of the job)
Something must have flushed when job reached the 12 hour mark, likely as part of it getting killed |
2024-07-19T17:16:23.923Z | <Ilya Dryomov> > when I look at that job's teuthology.log, I don't see it hang where you claim.
It was definitely hung at the point that I pasted -- I checked it multiple times over the course of a couple of hours (this was at the 8-9 hour mark of the job)
Something must have flushed when the job reached the 12 hour mark, likely as part of it getting killed |
2024-07-19T17:16:49.716Z | <Ilya Dryomov> > when I look at that job's teuthology.log, I don't see it hang where you claim.
It was definitely hung at the point that I pasted -- I checked multiple times over the course of a couple of hours (this was at the 8-9 hour mark of the job)
Something must have flushed when the job reached the 12 hour mark, likely as part of it getting killed |
2024-07-19T19:09:25.474Z | <Ilya Dryomov> It looks like the issue with attempting to kill is unrelated to the weird hang
I have just tried to kill a job in different run (one that is actually running and logging to `teuthology.log` is working fine) and `teuthology-kill` failed the same way for me |
2024-07-19T19:09:56.477Z | <Ilya Dryomov> ```$ teuthology-kill -J dis-2024-07-19_12:06:54-rbd-wip-dis-testing-distro-default-smithi/7808337
Traceback (most recent call last):
File "/cephfs/home/dis/src/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>
sys.exit(main())
File "/cephfs/home/dis/src/teuthology/scripts/kill.py", line 44, in main
teuthology.kill.main(args)
File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 38, in main
kill_job(run_name, job_id, archive_base, owner)
File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 87, in kill_job
raise RuntimeError(
RuntimeError: I could not figure out the owner of the requested job. Please pass --owner <owner>.``` |
2024-07-19T19:10:26.528Z | <Ilya Dryomov> It occurs to me that I haven't tried to kill a single job (as opposed to the entire run) in a long time |
2024-07-19T19:11:58.339Z | <Ilya Dryomov> ... so it's possible that this piece has been broken for a while |
2024-07-19T19:12:52.926Z | <Ilya Dryomov> Passing the job ID as a separate argument (`teuthology-kill -r dis-2024-07-19_12:06:54-rbd-wip-dis-testing-distro-default-smithi -j 7808337`) doesn't help |
2024-07-19T19:14:58.306Z | <Ilya Dryomov> ... as is passing the owner:
```$ teuthology-kill -r dis-2024-07-19_12:06:54-rbd-wip-dis-testing-distro-default-smithi -j 7808337 --owner scheduled_dis@teuthology
2024-07-19 19:13:35,228.228 INFO:teuthology.kill:No teuthology processes running
Traceback (most recent call last):
File "/cephfs/home/dis/src/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>
sys.exit(main())
File "/cephfs/home/dis/src/teuthology/scripts/kill.py", line 44, in main
teuthology.kill.main(args)
File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 38, in main
kill_job(run_name, job_id, archive_base, owner)
File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 93, in kill_job
report.try_push_job_info(job_info, dict(status="dead"))
File "/cephfs/home/dis/src/teuthology/teuthology/report.py", line 501, in try_push_job_info
run_name = job_config['name']
KeyError: 'name'``` |
2024-07-19T19:15:32.453Z | <Zack Cerza> let me take a look at that one |
2024-07-19T19:17:25.142Z | <Ilya Dryomov> It looks like the issue with attempting to kill is unrelated to the weird hang
I have just tried to kill a job in a different run (one that is actually running and where logging to `teuthology.log` is working fine) and `teuthology-kill` failed the same way for me |
2024-07-19T19:25:39.127Z | <Zack Cerza> oh. I see what is happening |
2024-07-19T19:26:22.379Z | <Zack Cerza> we're trying to get the run info from the filesystem, as we always had for kill operations. but:
```(virtualenv) zack@teuthology:~/teuthology_main$ ls ~teuthworker/archive
ls: cannot access '/home/teuthworker/archive': Permission denied``` |
2024-07-19T19:27:07.760Z | <Ilya Dryomov> So this is caused by the transition to `/cephfs` for home directories? |
2024-07-19T19:27:25.208Z | <Zack Cerza> appears so. |
2024-07-19T19:34:37.374Z | <Zack Cerza> ok, I have a fix that I think is good enough for now |
2024-07-19T19:35:06.658Z | <Zack Cerza> this issue only affects killing by job, which I'm assuming is why we hadn't noticed it before |
2024-07-19T19:38:30.386Z | <Zack Cerza> <https://github.com/ceph/teuthology/pull/1980> |