ceph - sepia - 2024-07-11

Timestamp (UTC)Message
2024-07-11T06:12:04.054Z
<Dan Mick> Yeah.  Sulcata02 had a bunch of kernel core dumps chewing up gugs.  I've taken it offline.
2024-07-11T19:45:39.873Z
<Laura Flores> The main nightlies for the rados suite are consistently failing with:
```Saw error while trying to spawn supervisor.```
Here's an example: <https://pulpito.ceph.com/teuthology-2024-05-12_20:00:20-rados-main-distro-default-smithi/>

There are no logs available under the jobs except for "orig.config.yaml", which isn't very helpful. @Zack Cerza do you have any idea what's going on?
2024-07-11T19:58:17.014Z
<Zack Cerza> that run's two months old; is there a newer example?
2024-07-11T20:00:01.716Z
<Zack Cerza> the handful of recent rados runs I just looked at didn't have this problem
2024-07-11T20:11:23.705Z
<Laura Flores> That's the latest example, which I understand is old at this point. Looking at more nightly runs for the rados suite, it seems that runs are waiting in the queue for long periods of time before actually running.

Overall link: <https://pulpito.ceph.com/?suite=rados&branch=main>

Here's another one from the end of June that's still queued: <https://pulpito.ceph.com/teuthology-2024-06-30_20:00:17-rados-main-distro-default-smithi/>
2024-07-11T20:12:08.010Z
<Laura Flores> My hunch is that jobs are waiting in the queue for too long and then just dying. Would you recommend I raise the priority of nighly runs? Or would that be too much for the queue? (each rados run is about 300 jobs)
2024-07-11T20:30:47.028Z
<Zack Cerza> We're incredibly backed up lately, and to me it seems like we are scheduling a _lot_ more jobs that end up hitting the twelve hour timeout. But none of that should cause a complete failure to simply spawn a process. I wouldn't be surprised if this happened during an OOM event. I feel like I remember one happening around then
2024-07-11T20:31:48.787Z
<Zack Cerza> Re: the queue size, I'm finishing up work on a feature that will allow us to specify expiration dates for runs. We'll also be able to have a global default expiration date. Expired jobs will be skipped by the dispatcher
2024-07-11T20:32:23.403Z
<Zack Cerza> Re: the queue size, I'm finishing up work on a feature that will allow us to specify expiration dates for runs. We'll also be able to have a global maximum time-since-scheduled age for jobs. Expired jobs will be skipped by the dispatcher
2024-07-11T20:33:08.062Z
<Zack Cerza> (this came from @Patrick Donnelly’s "deadline" idea)
2024-07-11T20:34:19.674Z
<Laura Flores> Gotcha
2024-07-11T20:35:19.636Z
<Laura Flores> Since we're so backed up, I think the best option for rados, it being such a big suite, is to divide it into sub suites than run at a higher priority. We don't need a full rados suite every single week, but we're not even getting reliable results on a monthly basis.
2024-07-11T20:35:54.702Z
<Laura Flores> If I look into that does that sound reasonable?
2024-07-11T20:41:37.887Z
<Dan Mick> Yeah.  Sulcata02 had a bunch of kernel core dumps chewing up gigs.  I've taken it offline.
2024-07-11T20:54:30.531Z
<Zack Cerza> Yeah I think that's a good idea

Any issue? please create an issue here and use the infra label.