2024-07-08T12:47:33.913Z | <Ronen Friedman> Will the real Jon Bailey please stand up?
All the jobs in my run failed with:
["The 'file' lookup had an issue accessing the file '~/.cache/src/keys/ssh/jonbailey1993.pub'. file not found, use -vvvvv to see paths searched"] |
2024-07-08T13:11:50.030Z | <Adam Kraitman> I would try to run it again |
2024-07-08T13:12:16.761Z | <Ronen Friedman> Just did; Failed again, with the same message |
2024-07-08T13:16:29.453Z | <Adam Kraitman> Maybe we should discuss it in the next infra meeting you can add that topic here <https://pad.ceph.com/p/ceph-infra-weekly> |
2024-07-08T13:29:03.011Z | <Adam Kraitman> Can you try running a different suite ? something looks a bit strange with that suite |
2024-07-08T13:29:43.749Z | <Ronen Friedman> Sure. Initiating |
2024-07-08T13:31:25.450Z | <Ronen Friedman> <https://pulpito.ceph.com/rfriedma-2024-07-08_13:30:27-rados-wip-rf-targets-j13-distro-default-smithi/> |
2024-07-08T13:40:03.311Z | <Kyrylo Shatskyy> @Adam Kraitman I can add the topic |
2024-07-08T13:43:41.422Z | <Ronen Friedman> Looks better. I think we're passed the failure point in the previous runs.
If so - what is the problem with rados:thrash?
Thanks |
2024-07-08T13:57:09.759Z | <Guillaume Abrioux> any build in shaman fails |
2024-07-08T13:57:12.478Z | <Guillaume Abrioux> ```
RPM build errors:
Signature not supported. Hash algorithm SHA1 not available.
Signature not supported. Hash algorithm SHA1 not available.
Bad exit status from /var/tmp/rpm-tmp.UXptz2 (%build)
Mon Jul 8 01:55:31 PM UTC 2024 :: rm -fr /tmp/install-deps.1010268``` |
2024-07-08T13:57:25.999Z | <Guillaume Abrioux> [https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/81020//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/81020//consoleFull) |
2024-07-08T13:58:40.670Z | <Guillaume Abrioux> at least for my branch, not sure if I did something wrong? |
2024-07-08T14:11:00.817Z | <John Mulligan> Anyone else having issues logging into sepia machines? I was planning on doing a interactive rerun on a teutholgy test I ran yesterday. However, today I can't ssh in to the tuthology vm. I appear to be on the vpn. I am on a different (office) network than I was yesterday. I also tried logging into a folio "dev playground" system and it seems to fail in the same way. |
2024-07-08T14:12:37.712Z | <Ronen Friedman> I am able to ssh (Teuthology, O10, playground) |
2024-07-08T14:13:51.087Z | <John Mulligan> Thanks. I was hoping it was not related to this office network. But now I'm thinking that's the main difference. I don't want to go home just to rerun this test 😕 |
2024-07-08T14:15:31.785Z | <Ronen Friedman> I was never able to connect to Sepia from an IBM office (only from RedHat). Per Mark Kogan - you should define 'split DNS', so that DNS queries for Sepia hosts won;t go to the IBM server. |
2024-07-08T14:16:23Z | <John Mulligan> thanks for the hint. I'll look into that |
2024-07-08T14:21:34.439Z | <John Mulligan> I decided to try a different lazier way. switched to usb tethering to my phone. rejoined sepia vpn. ssh works now. **sigh lol** |
2024-07-08T14:22:40.883Z | <Ronen Friedman> Yes. I too found that easier... |
2024-07-08T18:25:07.663Z | <yuriw> Do we have any DNS resolution problems now?
I can't ping `[github.com](http://github.com)` from `teuthology` or `vossi02` |
2024-07-08T18:57:25.239Z | <Dan Mick> something is happening on one of the builders that I don't yet understand, so, maybe connected |
2024-07-08T19:00:44.292Z | <Dan Mick> There may be an external network routing issue at the moment; some other community-cage members are reporting issues |
2024-07-08T19:01:06.264Z | <Dan Mick> yeah, it's not DNS specifically; I can't ping 8.8.8.8 |
2024-07-08T19:03:47.200Z | <Dan Mick> ...and now it seems back. |
2024-07-08T19:04:08.585Z | <Dan Mick> @yuriw try again and lmk |
2024-07-08T19:05:20.803Z | <Dan Mick> the key is usually "use this connection only for this network" or whatever it shows up as in your VPN setup thing |
2024-07-08T19:05:46.970Z | <yuriw> I am good, thx @Dan Mick |
2024-07-08T19:06:54.386Z | <John Mulligan> cool, thanks. I got into the tuethology vm and into folio02. But none of my interactive reruns were working. I'm probably being a blockhead and don't remember how to do it right, but none of the smiithi machines worked for ssh (from folio02 -> smithX that is) |
2024-07-08T19:07:36.597Z | <John Mulligan> I'll pester Adam King about it tomorrow 🙂 |
2024-07-08T19:19:28.426Z | <Dan Mick> do you have agent-forwarding on in your ssh client? Your ssh creds are local to your originating box, and need to be forwarded for multiple hops to work |
2024-07-08T19:19:39.333Z | <Dan Mick> <https://github.com/ceph/ceph-build/pull/2265> when you get a chance |
2024-07-08T19:22:09.607Z | <John Mulligan> I will double check but I ran it with `ssh -vv` and it was not making a connection to the host ip yet. (No route to host errors) |
2024-07-08T19:23:18.960Z | <Dan Mick> folio02 routes to smithi001 ok now |
2024-07-08T19:50:13.682Z | <nehaojha> @Dan Mick this [ceph.io](http://ceph.io) PR build failure <https://jenkins.ceph.com/job/ceph-website-prs/951/console> looks like a network issue
```stderr: fatal: unable to access '<https://github.com/ceph/ceph.io/>': Could not resolve host: [github.com](http://github.com)```
|
2024-07-08T19:50:23.185Z | <nehaojha> also <https://jenkins.ceph.com/job/ceph-pull-requests/138457/console> on a ceph PR |
2024-07-08T20:03:29.170Z | <Dan Mick> yes, as noted elsewhere the external net had a failure period today |
2024-07-08T20:26:06.264Z | <John Mulligan> Tried again right before driving home. Same error:
`ssh -vv [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)` ...
```debug2: match found
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug2: resolving "[smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)" port 22
debug1: Connecting to [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com) [172.21.15.154] port 22.
debug1: connect to address 172.21.15.154 port 22: No route to host
ssh: connect to host [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com) port 22: No route to host``` |
2024-07-08T20:26:38.883Z | <John Mulligan> FWIW:
```(virtualenv) [phlogistonjohn@folio02 ~]$ teuthology-lock --list
/home/phlogistonjohn/teuthology/teuthology/lock/cli.py:136: SyntaxWarning: invalid escape sequence '\w'
mo = re.match('\w+@(\w+?)\..*', s['name'])
[
{
"name": "[smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)",
"description": null,
"up": true,
"machine_type": "smithi",
"is_vm": false,
"vm_host": null,
"os_type": "centos",
"os_version": "9.stream",
"arch": "x86_64",
"locked": true,
"locked_since": "2024-07-08 19:28:58.032937",
"locked_by": "phlogistonjohn@folio02",
"mac_address": null,
"ssh_pub_key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBPgcqUXfQ2+WhIkbWKdGF+KxzG9XQPRTup1M1La6ey6+TmPQcLasIs+agoRkBvj9ViPltjaXe12lRJ4ZDhuyexM="
},
{
"name": "[smithi038.front.sepia.ceph.com](http://smithi038.front.sepia.ceph.com)",
"description": null,
"up": true,
"machine_type": "smithi",
"is_vm": false,
"vm_host": null,
"os_type": "centos",
"os_version": "9.stream",
"arch": "x86_64",
"locked": true,
"locked_since": "2024-07-08 19:28:58.034304",
"locked_by": "phlogistonjohn@folio02",
"mac_address": null,
"ssh_pub_key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBLKiM9GEL0R07TyQKIIhg/p8xDYq9vYQEz2+F+11AD3YklWZToCnTxZA8EFMJMQIHlB5ouIN3p13CZeQQiUH/78="
}
]``` |
2024-07-08T20:28:20.415Z | <Dan Mick> are you talking about the warning? |
2024-07-08T20:30:43.643Z | <John Mulligan> No, I have attempted to rerun som failed tests from sunday. But every time I run teuthology (not teutholgoy-suite). It locks hosts and then fails to ssh into them. See the ssh -vv debug output in the previous paste. It's not worked once for me today. I am possibly doing something wrong, but from my notes last time I did this I'm not. |
2024-07-08T20:31:15.960Z | <John Mulligan> I pasted the teuthology-lock results in case you wanted to see what hosts I was trying with just now. |
2024-07-08T20:38:45.846Z | <Dan Mick> ok I'm reproducing the ssh/no route for smithi154 |
2024-07-08T20:39:14.920Z | <John Mulligan> ok, glad it's not just me! |
2024-07-08T20:39:39.634Z | <Dan Mick> it's powered off |
2024-07-08T20:40:16.904Z | <John Mulligan> To quote Mr Urkel, "did I do that?" |
2024-07-08T20:40:28.334Z | <Dan Mick> so is 038 |
2024-07-08T20:40:30.259Z | <Dan Mick> I don't know |
2024-07-08T20:40:55.674Z | <Dan Mick> when were the nodes locked last? |
2024-07-08T20:41:48.157Z | <John Mulligan> ok. I just followed the process I used last time according to my notes, but this tim from folio02 ( according to the requst that we do more stuff using the dev playgrounds) |
2024-07-08T20:42:15.193Z | <John Mulligan> not sure what you mean by "locked last"? |
2024-07-08T20:42:48.522Z | <Dan Mick> in order to use a teuthology node it must be locked. This can be explicit per-host or done as part of a job submission. |
2024-07-08T20:43:22.350Z | <Dan Mick> teuthology-lock shows the lock status of those hosts. I'm suggesting that something went wrong in the locking process that lefft them powered down |
2024-07-08T20:43:24.440Z | <John Mulligan> OK, I ran `teuthology -v --lock --block --interactive-on-error reruns/2024-07-08_1.yaml` |
2024-07-08T20:43:45.889Z | <Dan Mick> did that apparently succeed? |
2024-07-08T20:44:13.249Z | <Dan Mick> er, did the locking part of that apparently succeed, I guess I should say |
2024-07-08T20:45:05.025Z | <John Mulligan> it failed. there was a traceback in a teuthology method and it dropped me to a propmt, and I then killed the process to (attempt to) debug it |
2024-07-08T20:45:20.897Z | <John Mulligan> I can't find the exact error at the moment |
2024-07-08T20:46:21.002Z | <Dan Mick> ok. hard to say; might have been some stupidity resulting from the network randomness |
2024-07-08T20:46:52.298Z | <John Mulligan> that's what I was wondering when I heard that earlier, but I don't know how to get the nodes out of this state. |
2024-07-08T20:47:01.719Z | <John Mulligan> they seem "stuck this way" for me |
2024-07-08T20:54:07.397Z | <Dan Mick> the dumbest way would be to unlock and then lock two others |
2024-07-08T20:54:23.206Z | <Dan Mick> but if there's a chance they're in the right state you can power them on and verify that |
2024-07-08T20:54:26.003Z | <Dan Mick> or I can |
2024-07-08T20:55:11.534Z | <Dan Mick> I have done so |
2024-07-08T20:55:33.937Z | <Dan Mick> let em get to sshability and then see if your teuth run will run |
2024-07-08T20:55:55.802Z | <John Mulligan> ok.... |
2024-07-08T20:56:13.578Z | <John Mulligan> do I have access to do what you did? If so, how? |
2024-07-08T20:56:41.343Z | <Dan Mick> (and let me just say again for the record that "no route to host" is the **stupidest** error message for "can't establish connection" ever...it has nothing to do with routing at all. I'm sure the ether will appreciate hearing my cries yet again) |
2024-07-08T20:57:02.428Z | <Dan Mick> I don't remember. It's ipmi |
2024-07-08T20:57:27.688Z | <John Mulligan> ah |
2024-07-08T20:57:40.174Z | <Dan Mick> <http://wiki.front.sepia.ceph.com/doku.php?id=testnodeaccess&s[]=ipmi#ipmi> |
2024-07-08T20:57:48.813Z | <Dan Mick> I don't recall if we make that password generally available. probably not. |
2024-07-08T20:57:55.227Z | <Dan Mick> it's easy to make a mistake and cause havoc |
2024-07-08T20:58:37.082Z | <Dan Mick> ....although it's in teuthology.yaml so I guess so |
2024-07-08T20:58:42.479Z | <John Mulligan> oho |
2024-07-08T20:58:55.679Z | <John Mulligan> yeah, I have seen ipmi commands fly by in the logs. I didn |
2024-07-08T20:59:21.805Z | <John Mulligan> yeah, I have seen ipmi commands fly by in the logs. I didn't think to dig up an old log file and see if I could reuse the commands in there 😐 |
2024-07-08T20:59:28.230Z | <Dan Mick> both up |
2024-07-08T21:00:35.293Z | <John Mulligan> I haven't tried running a test with nodes pre-locked before. is teuthology smart enough to "see" that I have these nodes locked already? |
2024-07-08T21:02:05.562Z | <Dan Mick> if you supply a targets yaml clause, which can either go in your job or as a separate file, and which you can generate with teuthology-lock --list-targets |
2024-07-08T21:02:06.667Z | <John Mulligan> (Manual ssh test worked) |
2024-07-08T21:02:39.744Z | <John Mulligan> perfect. |
2024-07-08T21:02:55.787Z | <Dan Mick> (and don't forget to unlock them when you're done) |
2024-07-08T21:08:13.441Z | <John Mulligan> Ah! It's running now! After messing the indent up like three times (my YAML Engineering degree needs to be revoked) |
2024-07-08T21:08:39.288Z | <John Mulligan> Thanks very much to the assistance, Dan! |
2024-07-08T21:08:45.013Z | <Dan Mick> np |
2024-07-08T21:08:51.329Z | <John Mulligan> Thanks very much for the assistance, Dan! |
2024-07-08T22:40:02.706Z | <John Mulligan> I think something is still up. This is what I now get:
```2024-07-08 21:55:27,254.254 INFO:teuthology.orchestra.run.smithi116.stdout:Package kernel-5.14.0-437.el9.x86_64 is already installed.
2024-07-08 21:55:27,254.254 INFO:teuthology.orchestra.run.smithi116.stdout:Package kernel-5.14.0-472.el9.x86_64 is already installed.
2024-07-08 21:55:27,272.272 DEBUG:teuthology.orchestra.run:got remote process result: None
2024-07-08 21:55:27,272.272 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/phlogistonjohn/teuthology/teuthology/run_tasks.py", line 109, in run_tasks
manager.__enter__()
File "/usr/lib64/python3.12/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 1236, in task
with parallel() as p:
File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 84, in __exit__
for result in self:
File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 98, in __next__
resurrect_traceback(result)
File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 30, in resurrect_traceback
raise exc.exc_info[1]
File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 23, in capture_traceback
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 1270, in process_role
version = need_to_install_distro(role_remote, role_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 761, in need_to_install_distro
install_stdout = remote.sh(
^^^^^^^^^^
File "/home/phlogistonjohn/teuthology/teuthology/orchestra/remote.py", line 97, in sh
proc = self.run(**kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/phlogistonjohn/teuthology/teuthology/orchestra/remote.py", line 523, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 455, in run
r.wait()
File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 174, in _raise_for_status
raise ConnectionLostError(command=self.command,
teuthology.exceptions.ConnectionLostError: SSH connection to smithi116 was lost: 'sudo yum install -y kernel'
2024-07-08 21:55:27,275.275 WARNING:teuthology.run_tasks:Saw failure during task execution, going into interactive mode...
Ceph test interactive mode, use ctx to interact with the cluster, press control-D to exit...``` |
2024-07-08T22:41:51.194Z | <John Mulligan> But I'm done for the day. I'll be closing my laptop. Maybe things will work better tomorow |