2025-01-08T05:41:52.995Z | <jcollin> @Venky Shankar Regarding your comment on changing cout to derr, I think this kind of search is needed <https://github.com/ceph/ceph/pull/44710/commits/a1a916547edd42a2961c9bd614b1a0409062c178>. |
2025-01-08T05:42:19.947Z | <Venky Shankar> yeh |
2025-01-08T05:42:24.782Z | <Venky Shankar> I think, yes |
2025-01-08T05:42:36.079Z | <Venky Shankar> but then, we need derr to maintain consistency of the way we log |
2025-01-08T05:43:24.514Z | <jcollin> ok |
2025-01-08T05:43:51.665Z | <jcollin> let me check |
2025-01-08T06:22:16.864Z | <jcollin> in vstart it logs into client.admin logs |
2025-01-08T06:22:22.840Z | <jcollin> @Venky Shankar |
2025-01-08T06:31:58.884Z | <Venky Shankar> kk |
2025-01-08T10:20:33.715Z | <RBecker> hi there, anyone around? facing a problem with our cephfs, mds are getting stuck in up:replay but the journal read and write positions are 0 |
2025-01-08T10:23:49.917Z | <Venky Shankar> Looks to me that the journal header got reset somehow? Could you explain how you ran into this? (running recovery tools, etc..)? |
2025-01-08T10:25:34.546Z | <RBecker> we haven't run any sort of recovery tools, but we did get a health warn stating that two clients were slow to respond to capability release. that then turned into slow metadata IOs blocked |
2025-01-08T10:25:52.572Z | <RBecker> any sort of cephfs-journal-tool (even to view/inspect) hangs indefinitely |
2025-01-08T10:26:38.585Z | <Venky Shankar> If the read/write positions are incorrect, then those tools are definitely going to bail out... |
2025-01-08T10:27:12.107Z | <Venky Shankar> I don't understand how slow release of caps by clients can turn into garbage r/w journal positions -- that's unheard of. |
2025-01-08T10:28:17.629Z | <RBecker> we did restart the clients reporting those issues and i believe the other admin did try restarting an MDS service. in any case, i'm less concerned about recovering prior journal data and at least being able to start fresh and get things working again |
2025-01-08T10:28:57.610Z | <Venky Shankar> tries resetting the journal? |
2025-01-08T10:29:04.597Z | <Venky Shankar> tried resetting the journal? |
2025-01-08T10:29:50.437Z | <RBecker> what's the command for that? is it cephfs-journal-tool journal reset? |
2025-01-08T10:30:38.842Z | <Venky Shankar> yes |
2025-01-08T10:30:42.582Z | <RBecker> that also hangs |
2025-01-08T10:31:19.771Z | <RBecker> i assume that doesn't need to be run on a specific node in the cluster, just one with access to the cluster itself |
2025-01-08T10:36:20.262Z | <Venky Shankar> yeh |
2025-01-08T10:37:46.900Z | <RBecker> unless it just needs a _really_ long time to complete, but that doesn't seem to be the case either |
2025-01-08T10:38:01.248Z | <Venky Shankar> ok, seeing journal reset -- it read read/write pos and tries to round that upto closest position based on the object layout. |
2025-01-08T10:38:43.276Z | <Venky Shankar> can you try running the command with `--debug_client=20` to see if that dumps any log to see what's going on? It run execute pretty fast.. |
2025-01-08T10:38:57.574Z | <RBecker> sure |
2025-01-08T10:39:54.328Z | <RBecker> `cephfs-journal-tool --rank=tcfs:0 journal reset --debug_client=20` - no output at all |
2025-01-08T10:42:18.237Z | <Venky Shankar> .. and I assume other ceph commands are working? |
2025-01-08T10:42:30.783Z | <RBecker> yep, i can do ceph -s, ceph health detail, ceph fs status, etc |
2025-01-08T10:45:01.895Z | <RBecker> is it possible there's an underlying rados issue somewhere? we're starting to see problems with some things accessing rgw images, and i've seen elsewhere in the docs that rados issues can cause cephfs issues, but no real info on how to troubleshoot beyond that |
2025-01-08T10:50:04.828Z | <Venky Shankar> it possible, but some command are working, so at least I'd expect the tool to be able to connect to the monitor. |
2025-01-08T10:50:36.770Z | /me <RBecker> nods |
2025-01-08T10:52:21.530Z | <RBecker> to note, we do have two cephfs pools, and both are being affected in the same way |
2025-01-08T13:28:15.125Z | <jcollin> @Venky Shankar I'll send status by email today. Can't talk, bad throat. |
2025-01-08T13:30:06.290Z | <Venky Shankar> noted. |
2025-01-08T15:57:33.460Z | <Austin Axworthy> Running into an issue with an MDS journal replay., we have two active MDS stuck in replay state. The journals are very large and the current ETA for replay is about 30+ hours. The pos made significant progress upfront, but has slowed to a crawl. Referring to the docs disabling debug logs will help increase the replay speed. Wondering what other options there are to optimize the replay speed.
Cluster was healthy when the mds failed over, but has now been falling behind on trimming.
`mds.0`
`"wrpos": 462153968414910,`
`"rdpos": 460617981984433,`
`mds.1`
`"wrpos": 20111573034462,`
`"rdpos": 19147055060473,` |
2025-01-08T17:16:29.598Z | <Andrea Bolzonella> add `--log_to_stderr true` |