ceph - sepia - 2024-07-19

Timestamp (UTC)Message
2024-07-19T07:33:29.718Z
<Ilya Dryomov> @Zack Cerza Re: uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of job where teuthology and/or infrastructure seem to be involved (as opposed to by the test): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to the log, it hung while trying to clone a repo with workunits (i.e. before actually running an tests)
However, that is NOT case -- logging in to `smithi027`, I see the repo cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it
2024-07-19T07:34:28.522Z
<Ilya Dryomov> @Zack Cerza Re: uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to the log, it hung while trying to clone a repo with workunits (i.e. before actually running any tests)
However, that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it
2024-07-19T07:37:58.778Z
<Ilya Dryomov> Ceph daemon logs (on the same node -- it's a single node test) stop abruptly 17 minutes after the last line in `teuthology.log`:
2024-07-19T07:38:16.945Z
<Ilya Dryomov> ```ubuntu@smithi027:/var/log/ceph$ tail -3 ceph-mon.a.log 
2024-07-18T23:17:44.373+0000 7fde9ac31640 20 mon.a@0(leader).paxos(paxos active c 754..1263) finish_round waiting_for_readable
2024-07-18T23:17:44.373+0000 7fde9ac31640 20 mon.a@0(leader).paxos(paxos active c 754..1263) finish_round waiting_for_writeable
2024-07-18T23:17:44.373+0000 7fde9ac31640 10 mon.a@0(leader).paxos(paxos active c 754..1263) finish_round done w/ waiters, state active```
2024-07-19T07:38:25.347Z
<Ilya Dryomov> ```ubuntu@smithi027:/var/log/ceph$ tail -3 ceph-mgr.x.log
2024-07-18T23:17:44.349+0000 7f6e58df0640 10 auth:  id 4 AQBeoJlmCMzXARAAJE1B3pmF+wbT5f3wA6IVLw== expires 2024-07-18T23:16:14.030918+0000
2024-07-18T23:17:44.349+0000 7f6e58df0640 10 auth:  id 5 AQBJoZlmnR/BAxAAjhf4C2bXSeC2mDKcVdgKgA== expires 2024-07-18T23:20:14.030918+0000
2024-07-18T23:17:44.349+0000 7f6e58df0640 10 auth:  id 6 AQA+oplmCuJWBRAAht+Muc5YZk2hAjXCcGKQqg== expires 2024-07-18T23:24:14.089579+0000```
2024-07-19T07:38:36.659Z
<Ilya Dryomov> ```ubuntu@smithi027:/var/log/ceph$ tail -3 ceph-osd.0.log 
2024-07-18T23:17:44.377+0000 7fd9d9a38640 20 osd.0 79 get_nextmap_reserved map_reservations: {79=1}
2024-07-18T23:17:44.377+0000 7fd9d9a38640 10 osd.0 79 maybe_share_map: con v2:172.21.15.27:6804/3442971125 our osdmap epoch of 79 is not newer than session's projected_epoch of 79
2024-07-18T23:17:44.377+0000 7fd9d9a38640  1 -- [v2:172.21.15.27:6805/938045435,v1:172.21.15.27:6807/938045435] --> [v2:172.21.15.27:6804/3442971125,v1:172.21.15.27:6806/3442971125] -- MOSDECSubOpWrite(3.3s2 79/19 ECSubWrite(tid=380184, reqid=client.4394.0:547744, at_version=79'65994, trim_to=73'55963, roll_forward_to=79'65993)) -- 0x55e8b0480000 con 0x55e8afb26000```
2024-07-19T07:39:18.076Z
<Ilya Dryomov> @Zack Cerza Re: uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to `teuthology.log`, the job hung while trying to clone a repo with workunits (i.e. before actually running any tests)
But that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it
2024-07-19T07:42:08.986Z
<Ilya Dryomov> These jobs also don't seem to be killable:
2024-07-19T07:42:11.309Z
<Ilya Dryomov> ```$ teuthology-kill -J dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926
Traceback (most recent call last):
  File "/cephfs/home/dis/src/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>
    sys.exit(main())
  File "/cephfs/home/dis/src/teuthology/scripts/kill.py", line 44, in main
    teuthology.kill.main(args)
  File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 38, in main
    kill_job(run_name, job_id, archive_base, owner)
  File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 87, in kill_job
    raise RuntimeError(
RuntimeError: I could not figure out the owner of the requested job. Please pass --owner <owner>.```
2024-07-19T07:42:27.635Z
<Ilya Dryomov> ```$ teuthology-kill -J dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926 --owner scheduled_dis@teuthology
2024-07-19 07:41:11,409.409 INFO:teuthology.kill:No teuthology processes running
Traceback (most recent call last):
  File "/cephfs/home/dis/src/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>
    sys.exit(main())
  File "/cephfs/home/dis/src/teuthology/scripts/kill.py", line 44, in main
    teuthology.kill.main(args)
  File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 38, in main
    kill_job(run_name, job_id, archive_base, owner)
  File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 93, in kill_job
    report.try_push_job_info(job_info, dict(status="dead"))
  File "/cephfs/home/dis/src/teuthology/teuthology/report.py", line 501, in try_push_job_info
    run_name = job_config['name']
KeyError: 'name'```
2024-07-19T07:43:18.859Z
<Ilya Dryomov> @Zack Cerza Re: recent uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to `teuthology.log`, the job hung while trying to clone a repo with workunits (i.e. before actually running any tests)
But that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it
2024-07-19T07:43:31.990Z
<Ilya Dryomov> @Zack Cerza Re: recent uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of a job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to `teuthology.log`, the job hung while trying to clone a repo with workunits (i.e. before actually running any tests)
But that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of running of the workunit but the cluster apparently getting zapped in the middle of it
2024-07-19T07:44:09.541Z
<Ilya Dryomov> @Zack Cerza Re: recent uptick of test jobs that get stuck and cleaned up only after 12h timeout
Here is an example of a job where teuthology and/or the infrastructure seem to be involved (as opposed to the test itself): <https://qa-proxy.ceph.com/teuthology/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7807926/teuthology.log>
```2024-07-18T23:00:46.552 INFO:tasks.workunit:Created dir /home/ubuntu/cephtest/mnt.0
2024-07-18T23:00:46.552 DEBUG:teuthology.orchestra.run.smithi027:> cd -- /home/ubuntu/cephtest/mnt.0 && mkdir -- client.0
2024-07-18T23:00:46.599 INFO:tasks.workunit:timeout=3h
2024-07-18T23:00:46.599 INFO:tasks.workunit:cleanup=True
2024-07-18T23:00:46.600 DEBUG:teuthology.orchestra.run.smithi027:> rm -rf /home/ubuntu/cephtest/clone.client.0 && git clone <https://github.com//ceph/ceph-ci> /home/ubuntu/cephtest/clone.client.0 && cd /home/ubuntu/cephtest/clone.client.0 && git checkout 01e1aa052c15e2a077a20de402d3ae763de4a30d
2024-07-18T23:00:46.650 INFO:tasks.workunit.client.0.smithi027.stderr:Cloning into '/home/ubuntu/cephtest/clone.client.0'...```
According to `teuthology.log`, the job hung while trying to clone a repo with workunits (i.e. before actually running any tests)
But that is NOT case -- logging in to `smithi027`, I see the repo fully cloned, as well as remnants of the workunit being run but the cluster apparently getting zapped in the middle of it
2024-07-19T07:47:25.284Z
<Ilya Dryomov> Ceph daemon logs (on the same node -- it's a single node job) stop abruptly 17 minutes after the last line in `teuthology.log`:
2024-07-19T07:54:11.281Z
<Ilya Dryomov> ... despite being reported as `Running` and hogging test nodes, of course
2024-07-19T08:15:37.079Z
<Ilya Dryomov> Ceph daemon logs (on the same node -- it's a single node job) end abruptly 17 minutes after the last line in `teuthology.log`:
2024-07-19T16:21:12.991Z
<Zack Cerza> when I look at that job's teuthology.log, I don't see it hang where you claim. I see the `git clone` take ~~46s, then the workunit starts, and the hang appears ~~12s after that.
I'm concerned about the errors during the kill attempts. those commands were run on teuthology.front, correct? I ask because that's how it might look if run on the wrong host
2024-07-19T17:14:27.202Z
<Ilya Dryomov> > when I look at that job's teuthology.log, I don't see it hang where you claim.
I was definitely hung where I claimed -- I checked multiple times over the course of a couple of hours (this was at the 8-9 hour mark of the job)
Something must have flushed when job reached the 12 hour mark, likely as part of it getting killed
2024-07-19T17:15:12.483Z
<Ilya Dryomov> > those commands were run on teuthology.front, correct? I ask because that's how it might look if run on the wrong host
Yes, they were run on teuthology.front
2024-07-19T17:16:13.396Z
<Ilya Dryomov> > when I look at that job's teuthology.log, I don't see it hang where you claim.
It was definitely hung at the point that I pasted -- I checked it multiple times over the course of a couple of hours (this was at the 8-9 hour mark of the job)
Something must have flushed when job reached the 12 hour mark, likely as part of it getting killed
2024-07-19T17:16:23.923Z
<Ilya Dryomov> > when I look at that job's teuthology.log, I don't see it hang where you claim.
It was definitely hung at the point that I pasted -- I checked it multiple times over the course of a couple of hours (this was at the 8-9 hour mark of the job)
Something must have flushed when the job reached the 12 hour mark, likely as part of it getting killed
2024-07-19T17:16:49.716Z
<Ilya Dryomov> > when I look at that job's teuthology.log, I don't see it hang where you claim.
It was definitely hung at the point that I pasted -- I checked multiple times over the course of a couple of hours (this was at the 8-9 hour mark of the job)
Something must have flushed when the job reached the 12 hour mark, likely as part of it getting killed
2024-07-19T19:09:25.474Z
<Ilya Dryomov> It looks like the issue with attempting to kill is unrelated to the weird hang
I have just tried to kill a job in different run (one that is actually running and logging to `teuthology.log` is working fine) and `teuthology-kill` failed the same way for me
2024-07-19T19:09:56.477Z
<Ilya Dryomov> ```$ teuthology-kill -J dis-2024-07-19_12:06:54-rbd-wip-dis-testing-distro-default-smithi/7808337
Traceback (most recent call last):
  File "/cephfs/home/dis/src/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>
    sys.exit(main())
  File "/cephfs/home/dis/src/teuthology/scripts/kill.py", line 44, in main
    teuthology.kill.main(args)
  File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 38, in main
    kill_job(run_name, job_id, archive_base, owner)
  File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 87, in kill_job
    raise RuntimeError(
RuntimeError: I could not figure out the owner of the requested job. Please pass --owner <owner>.```
2024-07-19T19:10:26.528Z
<Ilya Dryomov> It occurs to me that I haven't tried to kill a single job (as opposed to the entire run) in a long time
2024-07-19T19:11:58.339Z
<Ilya Dryomov> ... so it's possible that this piece has been broken for a while
2024-07-19T19:12:52.926Z
<Ilya Dryomov> Passing the job ID as a separate argument (`teuthology-kill -r dis-2024-07-19_12:06:54-rbd-wip-dis-testing-distro-default-smithi -j 7808337`) doesn't help
2024-07-19T19:14:58.306Z
<Ilya Dryomov> ... as is passing the owner:
```$ teuthology-kill -r dis-2024-07-19_12:06:54-rbd-wip-dis-testing-distro-default-smithi -j 7808337 --owner scheduled_dis@teuthology
2024-07-19 19:13:35,228.228 INFO:teuthology.kill:No teuthology processes running
Traceback (most recent call last):
  File "/cephfs/home/dis/src/teuthology/virtualenv/bin/teuthology-kill", line 8, in <module>
    sys.exit(main())
  File "/cephfs/home/dis/src/teuthology/scripts/kill.py", line 44, in main
    teuthology.kill.main(args)
  File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 38, in main
    kill_job(run_name, job_id, archive_base, owner)
  File "/cephfs/home/dis/src/teuthology/teuthology/kill.py", line 93, in kill_job
    report.try_push_job_info(job_info, dict(status="dead"))
  File "/cephfs/home/dis/src/teuthology/teuthology/report.py", line 501, in try_push_job_info
    run_name = job_config['name']
KeyError: 'name'```
2024-07-19T19:15:32.453Z
<Zack Cerza> let me take a look at that one
2024-07-19T19:17:25.142Z
<Ilya Dryomov> It looks like the issue with attempting to kill is unrelated to the weird hang
I have just tried to kill a job in a different run (one that is actually running and where logging to `teuthology.log` is working fine) and `teuthology-kill` failed the same way for me
2024-07-19T19:25:39.127Z
<Zack Cerza> oh. I see what is happening
2024-07-19T19:26:22.379Z
<Zack Cerza> we're trying to get the run info from the filesystem, as we always had for kill operations. but:
```(virtualenv) zack@teuthology:~/teuthology_main$ ls ~teuthworker/archive
ls: cannot access '/home/teuthworker/archive': Permission denied```
2024-07-19T19:27:07.760Z
<Ilya Dryomov> So this is caused by the transition to `/cephfs` for home directories?
2024-07-19T19:27:25.208Z
<Zack Cerza> appears so.
2024-07-19T19:34:37.374Z
<Zack Cerza> ok, I have a fix that I think is good enough for now
2024-07-19T19:35:06.658Z
<Zack Cerza> this issue only affects killing by job, which I'm assuming is why we hadn't noticed it before
2024-07-19T19:38:30.386Z
<Zack Cerza> <https://github.com/ceph/teuthology/pull/1980>

Any issue? please create an issue here and use the infra label.