ceph - sepia - 2024-07-08

Timestamp (UTC)	Message
2024-07-08T12:47:33.913Z	<Ronen Friedman> Will the real Jon Bailey please stand up? All the jobs in my run failed with: ["The 'file' lookup had an issue accessing the file '~/.cache/src/keys/ssh/jonbailey1993.pub'. file not found, use -vvvvv to see paths searched"]
2024-07-08T13:11:50.030Z	<Adam Kraitman> I would try to run it again
2024-07-08T13:12:16.761Z	<Ronen Friedman> Just did; Failed again, with the same message
2024-07-08T13:16:29.453Z	<Adam Kraitman> Maybe we should discuss it in the next infra meeting you can add that topic here <https://pad.ceph.com/p/ceph-infra-weekly>
2024-07-08T13:29:03.011Z	<Adam Kraitman> Can you try running a different suite ? something looks a bit strange with that suite
2024-07-08T13:29:43.749Z	<Ronen Friedman> Sure. Initiating
2024-07-08T13:31:25.450Z	<Ronen Friedman> <https://pulpito.ceph.com/rfriedma-2024-07-08_13:30:27-rados-wip-rf-targets-j13-distro-default-smithi/>
2024-07-08T13:40:03.311Z	<Kyrylo Shatskyy> @Adam Kraitman I can add the topic
2024-07-08T13:43:41.422Z	<Ronen Friedman> Looks better. I think we're passed the failure point in the previous runs. If so - what is the problem with rados:thrash? Thanks
2024-07-08T13:57:09.759Z	<Guillaume Abrioux> any build in shaman fails
2024-07-08T13:57:12.478Z	<Guillaume Abrioux> ``` RPM build errors: Signature not supported. Hash algorithm SHA1 not available. Signature not supported. Hash algorithm SHA1 not available. Bad exit status from /var/tmp/rpm-tmp.UXptz2 (%build) Mon Jul 8 01:55:31 PM UTC 2024 :: rm -fr /tmp/install-deps.1010268```
2024-07-08T13:57:25.999Z	<Guillaume Abrioux> [https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/81020//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/81020//consoleFull)
2024-07-08T13:58:40.670Z	<Guillaume Abrioux> at least for my branch, not sure if I did something wrong?
2024-07-08T14:11:00.817Z	<John Mulligan> Anyone else having issues logging into sepia machines? I was planning on doing a interactive rerun on a teutholgy test I ran yesterday. However, today I can't ssh in to the tuthology vm. I appear to be on the vpn. I am on a different (office) network than I was yesterday. I also tried logging into a folio "dev playground" system and it seems to fail in the same way.
2024-07-08T14:12:37.712Z	<Ronen Friedman> I am able to ssh (Teuthology, O10, playground)
2024-07-08T14:13:51.087Z	<John Mulligan> Thanks. I was hoping it was not related to this office network. But now I'm thinking that's the main difference. I don't want to go home just to rerun this test 😕
2024-07-08T14:15:31.785Z	<Ronen Friedman> I was never able to connect to Sepia from an IBM office (only from RedHat). Per Mark Kogan - you should define 'split DNS', so that DNS queries for Sepia hosts won;t go to the IBM server.
2024-07-08T14:16:23Z	<John Mulligan> thanks for the hint. I'll look into that
2024-07-08T14:21:34.439Z	<John Mulligan> I decided to try a different lazier way. switched to usb tethering to my phone. rejoined sepia vpn. ssh works now. sigh lol
2024-07-08T14:22:40.883Z	<Ronen Friedman> Yes. I too found that easier...
2024-07-08T18:25:07.663Z	<yuriw> Do we have any DNS resolution problems now? I can't ping `[github.com](http://github.com)` from `teuthology` or `vossi02`
2024-07-08T18:57:25.239Z	<Dan Mick> something is happening on one of the builders that I don't yet understand, so, maybe connected
2024-07-08T19:00:44.292Z	<Dan Mick> There may be an external network routing issue at the moment; some other community-cage members are reporting issues
2024-07-08T19:01:06.264Z	<Dan Mick> yeah, it's not DNS specifically; I can't ping 8.8.8.8
2024-07-08T19:03:47.200Z	<Dan Mick> ...and now it seems back.
2024-07-08T19:04:08.585Z	<Dan Mick> @yuriw try again and lmk
2024-07-08T19:05:20.803Z	<Dan Mick> the key is usually "use this connection only for this network" or whatever it shows up as in your VPN setup thing
2024-07-08T19:05:46.970Z	<yuriw> I am good, thx @Dan Mick
2024-07-08T19:06:54.386Z	<John Mulligan> cool, thanks. I got into the tuethology vm and into folio02. But none of my interactive reruns were working. I'm probably being a blockhead and don't remember how to do it right, but none of the smiithi machines worked for ssh (from folio02 -> smithX that is)
2024-07-08T19:07:36.597Z	<John Mulligan> I'll pester Adam King about it tomorrow 🙂
2024-07-08T19:19:28.426Z	<Dan Mick> do you have agent-forwarding on in your ssh client? Your ssh creds are local to your originating box, and need to be forwarded for multiple hops to work
2024-07-08T19:19:39.333Z	<Dan Mick> <https://github.com/ceph/ceph-build/pull/2265> when you get a chance
2024-07-08T19:22:09.607Z	<John Mulligan> I will double check but I ran it with `ssh -vv` and it was not making a connection to the host ip yet. (No route to host errors)
2024-07-08T19:23:18.960Z	<Dan Mick> folio02 routes to smithi001 ok now
2024-07-08T19:50:13.682Z	<nehaojha> @Dan Mick this [ceph.io](http://ceph.io) PR build failure <https://jenkins.ceph.com/job/ceph-website-prs/951/console> looks like a network issue ```stderr: fatal: unable to access '<https://github.com/ceph/ceph.io/>': Could not resolve host: [github.com](http://github.com)```
2024-07-08T19:50:23.185Z	<nehaojha> also <https://jenkins.ceph.com/job/ceph-pull-requests/138457/console> on a ceph PR
2024-07-08T20:03:29.170Z	<Dan Mick> yes, as noted elsewhere the external net had a failure period today
2024-07-08T20:26:06.264Z	<John Mulligan> Tried again right before driving home. Same error: `ssh -vv [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)` ... ```debug2: match found debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config debug2: resolving "[smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)" port 22 debug1: Connecting to [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com) [172.21.15.154] port 22. debug1: connect to address 172.21.15.154 port 22: No route to host ssh: connect to host [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com) port 22: No route to host```
2024-07-08T20:26:38.883Z	<John Mulligan> FWIW: ```(virtualenv) [phlogistonjohn@folio02 ~]$ teuthology-lock --list /home/phlogistonjohn/teuthology/teuthology/lock/cli.py:136: SyntaxWarning: invalid escape sequence '\w' mo = re.match('\w+@(\w+?)\..*', s['name']) [ { "name": "[smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)", "description": null, "up": true, "machine_type": "smithi", "is_vm": false, "vm_host": null, "os_type": "centos", "os_version": "9.stream", "arch": "x86_64", "locked": true, "locked_since": "2024-07-08 19:28:58.032937", "locked_by": "phlogistonjohn@folio02", "mac_address": null, "ssh_pub_key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBPgcqUXfQ2+WhIkbWKdGF+KxzG9XQPRTup1M1La6ey6+TmPQcLasIs+agoRkBvj9ViPltjaXe12lRJ4ZDhuyexM=" }, { "name": "[smithi038.front.sepia.ceph.com](http://smithi038.front.sepia.ceph.com)", "description": null, "up": true, "machine_type": "smithi", "is_vm": false, "vm_host": null, "os_type": "centos", "os_version": "9.stream", "arch": "x86_64", "locked": true, "locked_since": "2024-07-08 19:28:58.034304", "locked_by": "phlogistonjohn@folio02", "mac_address": null, "ssh_pub_key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBLKiM9GEL0R07TyQKIIhg/p8xDYq9vYQEz2+F+11AD3YklWZToCnTxZA8EFMJMQIHlB5ouIN3p13CZeQQiUH/78=" } ]```
2024-07-08T20:28:20.415Z	<Dan Mick> are you talking about the warning?
2024-07-08T20:30:43.643Z	<John Mulligan> No, I have attempted to rerun som failed tests from sunday. But every time I run teuthology (not teutholgoy-suite). It locks hosts and then fails to ssh into them. See the ssh -vv debug output in the previous paste. It's not worked once for me today. I am possibly doing something wrong, but from my notes last time I did this I'm not.
2024-07-08T20:31:15.960Z	<John Mulligan> I pasted the teuthology-lock results in case you wanted to see what hosts I was trying with just now.
2024-07-08T20:38:45.846Z	<Dan Mick> ok I'm reproducing the ssh/no route for smithi154
2024-07-08T20:39:14.920Z	<John Mulligan> ok, glad it's not just me!
2024-07-08T20:39:39.634Z	<Dan Mick> it's powered off
2024-07-08T20:40:16.904Z	<John Mulligan> To quote Mr Urkel, "did I do that?"
2024-07-08T20:40:28.334Z	<Dan Mick> so is 038
2024-07-08T20:40:30.259Z	<Dan Mick> I don't know
2024-07-08T20:40:55.674Z	<Dan Mick> when were the nodes locked last?
2024-07-08T20:41:48.157Z	<John Mulligan> ok. I just followed the process I used last time according to my notes, but this tim from folio02 ( according to the requst that we do more stuff using the dev playgrounds)
2024-07-08T20:42:15.193Z	<John Mulligan> not sure what you mean by "locked last"?
2024-07-08T20:42:48.522Z	<Dan Mick> in order to use a teuthology node it must be locked. This can be explicit per-host or done as part of a job submission.
2024-07-08T20:43:22.350Z	<Dan Mick> teuthology-lock shows the lock status of those hosts. I'm suggesting that something went wrong in the locking process that lefft them powered down
2024-07-08T20:43:24.440Z	<John Mulligan> OK, I ran `teuthology -v --lock --block --interactive-on-error reruns/2024-07-08_1.yaml`
2024-07-08T20:43:45.889Z	<Dan Mick> did that apparently succeed?
2024-07-08T20:44:13.249Z	<Dan Mick> er, did the locking part of that apparently succeed, I guess I should say
2024-07-08T20:45:05.025Z	<John Mulligan> it failed. there was a traceback in a teuthology method and it dropped me to a propmt, and I then killed the process to (attempt to) debug it
2024-07-08T20:45:20.897Z	<John Mulligan> I can't find the exact error at the moment
2024-07-08T20:46:21.002Z	<Dan Mick> ok. hard to say; might have been some stupidity resulting from the network randomness
2024-07-08T20:46:52.298Z	<John Mulligan> that's what I was wondering when I heard that earlier, but I don't know how to get the nodes out of this state.
2024-07-08T20:47:01.719Z	<John Mulligan> they seem "stuck this way" for me
2024-07-08T20:54:07.397Z	<Dan Mick> the dumbest way would be to unlock and then lock two others
2024-07-08T20:54:23.206Z	<Dan Mick> but if there's a chance they're in the right state you can power them on and verify that
2024-07-08T20:54:26.003Z	<Dan Mick> or I can
2024-07-08T20:55:11.534Z	<Dan Mick> I have done so
2024-07-08T20:55:33.937Z	<Dan Mick> let em get to sshability and then see if your teuth run will run
2024-07-08T20:55:55.802Z	<John Mulligan> ok....
2024-07-08T20:56:13.578Z	<John Mulligan> do I have access to do what you did? If so, how?
2024-07-08T20:56:41.343Z	<Dan Mick> (and let me just say again for the record that "no route to host" is the stupidest error message for "can't establish connection" ever...it has nothing to do with routing at all. I'm sure the ether will appreciate hearing my cries yet again)
2024-07-08T20:57:02.428Z	<Dan Mick> I don't remember. It's ipmi
2024-07-08T20:57:27.688Z	<John Mulligan> ah
2024-07-08T20:57:40.174Z	<Dan Mick> <http://wiki.front.sepia.ceph.com/doku.php?id=testnodeaccess&s[]=ipmi#ipmi>
2024-07-08T20:57:48.813Z	<Dan Mick> I don't recall if we make that password generally available. probably not.
2024-07-08T20:57:55.227Z	<Dan Mick> it's easy to make a mistake and cause havoc
2024-07-08T20:58:37.082Z	<Dan Mick> ....although it's in teuthology.yaml so I guess so
2024-07-08T20:58:42.479Z	<John Mulligan> oho
2024-07-08T20:58:55.679Z	<John Mulligan> yeah, I have seen ipmi commands fly by in the logs. I didn
2024-07-08T20:59:21.805Z	<John Mulligan> yeah, I have seen ipmi commands fly by in the logs. I didn't think to dig up an old log file and see if I could reuse the commands in there 😐
2024-07-08T20:59:28.230Z	<Dan Mick> both up
2024-07-08T21:00:35.293Z	<John Mulligan> I haven't tried running a test with nodes pre-locked before. is teuthology smart enough to "see" that I have these nodes locked already?
2024-07-08T21:02:05.562Z	<Dan Mick> if you supply a targets yaml clause, which can either go in your job or as a separate file, and which you can generate with teuthology-lock --list-targets
2024-07-08T21:02:06.667Z	<John Mulligan> (Manual ssh test worked)
2024-07-08T21:02:39.744Z	<John Mulligan> perfect.
2024-07-08T21:02:55.787Z	<Dan Mick> (and don't forget to unlock them when you're done)
2024-07-08T21:08:13.441Z	<John Mulligan> Ah! It's running now! After messing the indent up like three times (my YAML Engineering degree needs to be revoked)
2024-07-08T21:08:39.288Z	<John Mulligan> Thanks very much to the assistance, Dan!
2024-07-08T21:08:45.013Z	<Dan Mick> np
2024-07-08T21:08:51.329Z	<John Mulligan> Thanks very much for the assistance, Dan!
2024-07-08T22:40:02.706Z	<John Mulligan> I think something is still up. This is what I now get: ```2024-07-08 21:55:27,254.254 INFO:teuthology.orchestra.run.smithi116.stdout:Package kernel-5.14.0-437.el9.x86_64 is already installed. 2024-07-08 21:55:27,254.254 INFO:teuthology.orchestra.run.smithi116.stdout:Package kernel-5.14.0-472.el9.x86_64 is already installed. 2024-07-08 21:55:27,272.272 DEBUG:teuthology.orchestra.run:got remote process result: None 2024-07-08 21:55:27,272.272 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/phlogistonjohn/teuthology/teuthology/run_tasks.py", line 109, in run_tasks manager.__enter__() File "/usr/lib64/python3.12/contextlib.py", line 137, in __enter__ return next(self.gen) ^^^^^^^^^^^^^^ File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 1236, in task with parallel() as p: File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 84, in __exit__ for result in self: File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 98, in __next__ resurrect_traceback(result) File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 30, in resurrect_traceback raise exc.exc_info[1] File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 23, in capture_traceback return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 1270, in process_role version = need_to_install_distro(role_remote, role_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 761, in need_to_install_distro install_stdout = remote.sh( ^^^^^^^^^^ File "/home/phlogistonjohn/teuthology/teuthology/orchestra/remote.py", line 97, in sh proc = self.run(kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/phlogistonjohn/teuthology/teuthology/orchestra/remote.py", line 523, in run r = self._runner(client=self.ssh, name=self.shortname, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 174, in _raise_for_status raise ConnectionLostError(command=self.command, teuthology.exceptions.ConnectionLostError: SSH connection to smithi116 was lost: 'sudo yum install -y kernel' 2024-07-08 21:55:27,275.275 WARNING:teuthology.run_tasks:Saw failure during task execution, going into interactive mode... Ceph test interactive mode, use ctx to interact with the cluster, press control-D to exit...```
2024-07-08T22:41:51.194Z	<John Mulligan> But I'm done for the day. I'll be closing my laptop. Maybe things will work better tomorow

ceph - sepia - 2024-07-08

Any issue? please create an issue here and use the infra label.