ceph - sepia - 2024-07-08

Timestamp (UTC)Message
2024-07-08T12:47:33.913Z
<Ronen Friedman> Will the real Jon Bailey please stand up?
All the jobs in my run failed with:
["The 'file' lookup had an issue accessing the file '~/.cache/src/keys/ssh/jonbailey1993.pub'. file not found, use -vvvvv to see paths searched"]
2024-07-08T13:11:50.030Z
<Adam Kraitman> I would try to run it again
2024-07-08T13:12:16.761Z
<Ronen Friedman> Just did; Failed again, with the same message
2024-07-08T13:16:29.453Z
<Adam Kraitman> Maybe we should discuss it in the next infra meeting you can add that topic here <https://pad.ceph.com/p/ceph-infra-weekly>
2024-07-08T13:29:03.011Z
<Adam Kraitman> Can you try running a different suite ? something looks a bit strange with that suite
2024-07-08T13:29:43.749Z
<Ronen Friedman> Sure. Initiating
2024-07-08T13:31:25.450Z
<Ronen Friedman> <https://pulpito.ceph.com/rfriedma-2024-07-08_13:30:27-rados-wip-rf-targets-j13-distro-default-smithi/>
2024-07-08T13:40:03.311Z
<Kyrylo Shatskyy> @Adam Kraitman I can add the topic
2024-07-08T13:43:41.422Z
<Ronen Friedman> Looks better. I think we're passed the failure point in the previous runs.
If so - what is the problem with rados:thrash?
Thanks
2024-07-08T13:57:09.759Z
<Guillaume Abrioux> any build in shaman fails
2024-07-08T13:57:12.478Z
<Guillaume Abrioux> ```
RPM build errors:
    Signature not supported. Hash algorithm SHA1 not available.
    Signature not supported. Hash algorithm SHA1 not available.
    Bad exit status from /var/tmp/rpm-tmp.UXptz2 (%build)
Mon Jul  8 01:55:31 PM UTC 2024 :: rm -fr /tmp/install-deps.1010268```
2024-07-08T13:57:25.999Z
<Guillaume Abrioux> [https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/81020//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/81020//consoleFull)
2024-07-08T13:58:40.670Z
<Guillaume Abrioux> at least for my branch, not sure if I did something wrong?
2024-07-08T14:11:00.817Z
<John Mulligan> Anyone else having issues logging into sepia machines? I was planning on doing a interactive rerun on a teutholgy  test I ran yesterday. However, today I can't ssh in to the tuthology vm. I appear to be on the vpn. I am on  a different (office) network than I was yesterday.  I also tried logging into a folio "dev playground" system and it seems to fail in the same way.
2024-07-08T14:12:37.712Z
<Ronen Friedman> I am able to ssh (Teuthology, O10, playground)
2024-07-08T14:13:51.087Z
<John Mulligan> Thanks. I was hoping it was not related to this office network. But now I'm thinking that's the main difference. I don't want to go home just to rerun this test 😕
2024-07-08T14:15:31.785Z
<Ronen Friedman> I was never able to connect to Sepia from an IBM office (only from RedHat). Per Mark Kogan - you should define 'split DNS', so that DNS queries for Sepia hosts won;t go to the IBM server.
2024-07-08T14:16:23Z
<John Mulligan> thanks for the hint. I'll look into that
2024-07-08T14:21:34.439Z
<John Mulligan> I decided to try a different lazier way. switched to usb tethering to my phone. rejoined sepia vpn. ssh works now. **sigh lol**
2024-07-08T14:22:40.883Z
<Ronen Friedman> Yes. I too found that easier...
2024-07-08T18:25:07.663Z
<yuriw> Do we have any DNS resolution problems now?
I can't ping `[github.com](http://github.com)` from `teuthology` or `vossi02`
2024-07-08T18:57:25.239Z
<Dan Mick> something is happening on one of the builders that I don't yet understand, so, maybe connected
2024-07-08T19:00:44.292Z
<Dan Mick> There may be an external network routing issue at the moment; some other community-cage members are reporting issues
2024-07-08T19:01:06.264Z
<Dan Mick> yeah, it's not DNS specifically; I can't ping 8.8.8.8
2024-07-08T19:03:47.200Z
<Dan Mick> ...and now it seems back.
2024-07-08T19:04:08.585Z
<Dan Mick> @yuriw try again and lmk
2024-07-08T19:05:20.803Z
<Dan Mick> the key is usually "use this connection only for this network" or whatever it shows up as in your VPN setup thing
2024-07-08T19:05:46.970Z
<yuriw> I am good, thx @Dan Mick
2024-07-08T19:06:54.386Z
<John Mulligan> cool, thanks.  I got into the tuethology vm and into folio02. But none of my interactive reruns were working. I'm probably being a blockhead and don't remember how to do it right, but none of the smiithi machines worked for ssh (from folio02 -> smithX that is)
2024-07-08T19:07:36.597Z
<John Mulligan> I'll pester Adam King about it tomorrow 🙂
2024-07-08T19:19:28.426Z
<Dan Mick> do you have agent-forwarding on in your ssh client?  Your ssh creds are local to your originating box, and need to be forwarded for multiple hops to work
2024-07-08T19:19:39.333Z
<Dan Mick> <https://github.com/ceph/ceph-build/pull/2265> when you get a chance
2024-07-08T19:22:09.607Z
<John Mulligan> I will double check but I ran it with `ssh -vv` and it was not making a connection to the host ip yet. (No route to host errors)
2024-07-08T19:23:18.960Z
<Dan Mick> folio02 routes to smithi001 ok now
2024-07-08T19:50:13.682Z
<nehaojha> @Dan Mick this [ceph.io](http://ceph.io) PR build failure <https://jenkins.ceph.com/job/ceph-website-prs/951/console> looks like a network issue
```stderr: fatal: unable to access '<https://github.com/ceph/ceph.io/>': Could not resolve host: [github.com](http://github.com)```
2024-07-08T19:50:23.185Z
<nehaojha> also <https://jenkins.ceph.com/job/ceph-pull-requests/138457/console> on a ceph PR
2024-07-08T20:03:29.170Z
<Dan Mick> yes, as noted elsewhere the external net had a failure period today
2024-07-08T20:26:06.264Z
<John Mulligan> Tried again right before driving home. Same error:
`ssh -vv [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)` ...
```debug2: match found
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug2: resolving "[smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)" port 22
debug1: Connecting to [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com) [172.21.15.154] port 22.
debug1: connect to address 172.21.15.154 port 22: No route to host
ssh: connect to host [smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com) port 22: No route to host```
2024-07-08T20:26:38.883Z
<John Mulligan> FWIW:
```(virtualenv) [phlogistonjohn@folio02 ~]$ teuthology-lock --list
/home/phlogistonjohn/teuthology/teuthology/lock/cli.py:136: SyntaxWarning: invalid escape sequence '\w'
  mo = re.match('\w+@(\w+?)\..*', s['name'])
[
    {
        "name": "[smithi154.front.sepia.ceph.com](http://smithi154.front.sepia.ceph.com)",
        "description": null,
        "up": true,
        "machine_type": "smithi",
        "is_vm": false,
        "vm_host": null,
        "os_type": "centos",
        "os_version": "9.stream",
        "arch": "x86_64",
        "locked": true,
        "locked_since": "2024-07-08 19:28:58.032937",
        "locked_by": "phlogistonjohn@folio02",
        "mac_address": null,
        "ssh_pub_key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBPgcqUXfQ2+WhIkbWKdGF+KxzG9XQPRTup1M1La6ey6+TmPQcLasIs+agoRkBvj9ViPltjaXe12lRJ4ZDhuyexM="
    },
    {
        "name": "[smithi038.front.sepia.ceph.com](http://smithi038.front.sepia.ceph.com)",
        "description": null,
        "up": true,
        "machine_type": "smithi",
        "is_vm": false,
        "vm_host": null,
        "os_type": "centos",
        "os_version": "9.stream",
        "arch": "x86_64",
        "locked": true,
        "locked_since": "2024-07-08 19:28:58.034304",
        "locked_by": "phlogistonjohn@folio02",
        "mac_address": null,
        "ssh_pub_key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBLKiM9GEL0R07TyQKIIhg/p8xDYq9vYQEz2+F+11AD3YklWZToCnTxZA8EFMJMQIHlB5ouIN3p13CZeQQiUH/78="
    }
]```
2024-07-08T20:28:20.415Z
<Dan Mick> are you talking about the warning?
2024-07-08T20:30:43.643Z
<John Mulligan> No, I have attempted to rerun som failed tests from sunday. But every time I run teuthology (not teutholgoy-suite). It locks hosts and then fails to ssh into them. See the ssh -vv debug output in the previous paste.  It's not worked once for me today. I am possibly doing something wrong, but from my notes last time I did this I'm not.
2024-07-08T20:31:15.960Z
<John Mulligan> I pasted the teuthology-lock results in case you wanted to see what hosts I was trying with just now.
2024-07-08T20:38:45.846Z
<Dan Mick> ok I'm reproducing the ssh/no route for smithi154
2024-07-08T20:39:14.920Z
<John Mulligan> ok, glad it's not just me!
2024-07-08T20:39:39.634Z
<Dan Mick> it's powered off
2024-07-08T20:40:16.904Z
<John Mulligan> To quote Mr Urkel, "did I do that?"
2024-07-08T20:40:28.334Z
<Dan Mick> so is 038
2024-07-08T20:40:30.259Z
<Dan Mick> I don't know
2024-07-08T20:40:55.674Z
<Dan Mick> when were the nodes locked last?
2024-07-08T20:41:48.157Z
<John Mulligan> ok. I just followed the process I used last time according to my notes, but this tim from folio02 ( according to the requst that we do more stuff using the dev playgrounds)
2024-07-08T20:42:15.193Z
<John Mulligan> not sure what you mean by "locked last"?
2024-07-08T20:42:48.522Z
<Dan Mick> in order to use a teuthology node it must be locked.  This can be explicit per-host or done as part of a job submission.
2024-07-08T20:43:22.350Z
<Dan Mick> teuthology-lock shows the lock status of those hosts.  I'm suggesting that something went wrong in the locking process that lefft them powered down
2024-07-08T20:43:24.440Z
<John Mulligan> OK, I ran `teuthology -v --lock --block --interactive-on-error reruns/2024-07-08_1.yaml`
2024-07-08T20:43:45.889Z
<Dan Mick> did that apparently succeed?
2024-07-08T20:44:13.249Z
<Dan Mick> er, did the locking part of that apparently succeed, I guess I should say
2024-07-08T20:45:05.025Z
<John Mulligan> it failed. there was a traceback in a teuthology method and it dropped me to a propmt, and I then killed the process to (attempt to) debug it
2024-07-08T20:45:20.897Z
<John Mulligan> I can't find the exact error at the moment
2024-07-08T20:46:21.002Z
<Dan Mick> ok.   hard to say; might have been some stupidity resulting from the network randomness
2024-07-08T20:46:52.298Z
<John Mulligan> that's what I was wondering when I heard that earlier, but I don't know how to get the nodes out of this state.
2024-07-08T20:47:01.719Z
<John Mulligan> they seem "stuck this way" for me
2024-07-08T20:54:07.397Z
<Dan Mick> the dumbest way would be to unlock and then lock two others
2024-07-08T20:54:23.206Z
<Dan Mick> but if there's a chance they're in the right state you can power them on and verify that
2024-07-08T20:54:26.003Z
<Dan Mick> or I can
2024-07-08T20:55:11.534Z
<Dan Mick> I have done so
2024-07-08T20:55:33.937Z
<Dan Mick> let em get to sshability and then see if your teuth run will run
2024-07-08T20:55:55.802Z
<John Mulligan> ok....
2024-07-08T20:56:13.578Z
<John Mulligan> do I have access to do what you did? If so, how?
2024-07-08T20:56:41.343Z
<Dan Mick> (and let me just say again for the record that "no route to host" is the **stupidest** error message for "can't establish connection" ever...it has nothing to do with routing at all.  I'm sure the ether will appreciate hearing my cries yet again)
2024-07-08T20:57:02.428Z
<Dan Mick> I don't remember.  It's ipmi
2024-07-08T20:57:27.688Z
<John Mulligan> ah
2024-07-08T20:57:40.174Z
<Dan Mick> <http://wiki.front.sepia.ceph.com/doku.php?id=testnodeaccess&s[]=ipmi#ipmi>
2024-07-08T20:57:48.813Z
<Dan Mick> I don't recall if we make that password generally available.  probably not.
2024-07-08T20:57:55.227Z
<Dan Mick> it's easy to make a mistake and cause havoc
2024-07-08T20:58:37.082Z
<Dan Mick> ....although it's in teuthology.yaml so I guess so
2024-07-08T20:58:42.479Z
<John Mulligan> oho
2024-07-08T20:58:55.679Z
<John Mulligan> yeah, I have seen ipmi commands fly by in the logs. I didn
2024-07-08T20:59:21.805Z
<John Mulligan> yeah, I have seen ipmi commands fly by in the logs. I didn't think to dig up an old log file and see if I could reuse the commands in there 😐
2024-07-08T20:59:28.230Z
<Dan Mick> both up
2024-07-08T21:00:35.293Z
<John Mulligan> I haven't tried running a test with nodes pre-locked before. is teuthology smart enough to "see" that I have these nodes locked already?
2024-07-08T21:02:05.562Z
<Dan Mick> if you supply a targets yaml clause, which can either go in your job or as a separate file, and which you can generate with teuthology-lock --list-targets
2024-07-08T21:02:06.667Z
<John Mulligan> (Manual ssh test worked)
2024-07-08T21:02:39.744Z
<John Mulligan> perfect.
2024-07-08T21:02:55.787Z
<Dan Mick> (and don't forget to unlock them when you're done)
2024-07-08T21:08:13.441Z
<John Mulligan> Ah! It's running now!  After messing the indent up like three times (my YAML Engineering degree needs to be revoked)
2024-07-08T21:08:39.288Z
<John Mulligan> Thanks very much to the assistance, Dan!
2024-07-08T21:08:45.013Z
<Dan Mick> np
2024-07-08T21:08:51.329Z
<John Mulligan> Thanks very much for the assistance, Dan!
2024-07-08T22:40:02.706Z
<John Mulligan> I think something is still up. This is what I now get:
```2024-07-08 21:55:27,254.254 INFO:teuthology.orchestra.run.smithi116.stdout:Package kernel-5.14.0-437.el9.x86_64 is already installed.
2024-07-08 21:55:27,254.254 INFO:teuthology.orchestra.run.smithi116.stdout:Package kernel-5.14.0-472.el9.x86_64 is already installed.
2024-07-08 21:55:27,272.272 DEBUG:teuthology.orchestra.run:got remote process result: None
2024-07-08 21:55:27,272.272 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/phlogistonjohn/teuthology/teuthology/run_tasks.py", line 109, in run_tasks
    manager.__enter__()
  File "/usr/lib64/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 1236, in task
    with parallel() as p:
  File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/phlogistonjohn/teuthology/teuthology/parallel.py", line 23, in capture_traceback 
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 1270, in process_role 
    version = need_to_install_distro(role_remote, role_config)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/phlogistonjohn/teuthology/teuthology/task/kernel.py", line 761, in need_to_install_distro
    install_stdout = remote.sh(
                     ^^^^^^^^^^
  File "/home/phlogistonjohn/teuthology/teuthology/orchestra/remote.py", line 97, in sh
    proc = self.run(**kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/phlogistonjohn/teuthology/teuthology/orchestra/remote.py", line 523, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/phlogistonjohn/teuthology/teuthology/orchestra/run.py", line 174, in _raise_for_status
    raise ConnectionLostError(command=self.command,
teuthology.exceptions.ConnectionLostError: SSH connection to smithi116 was lost: 'sudo yum install -y kernel'
2024-07-08 21:55:27,275.275 WARNING:teuthology.run_tasks:Saw failure during task execution, going into interactive mode...
Ceph test interactive mode, use ctx to interact with the cluster, press control-D to exit...```
2024-07-08T22:41:51.194Z
<John Mulligan> But I'm done for the day. I'll be closing my laptop. Maybe things will work better tomorow

Any issue? please create an issue here and use the infra label.