ceph - ceph-devel - 2024-10-14

Timestamp (UTC)Message
2024-10-14T00:44:58.118Z
<Alexander Patrakov> We hit the same loop when scrubbing a specific subdirectory
2024-10-14T07:36:35.098Z
<Milind Changire> for [this windows build](https://jenkins.ceph.com/job/ceph-windows-pull-requests/48253/consoleFull#212386344640526d21-3511-427d-909c-dd086c0d1034) in Jenkins, I get the following error:
```ld.lld: error: unable to automatically import from global_thread_id with relocation type IMAGE_REL_AMD64_SECREL in src/librados/CMakeFiles/librados.dir/librados_c.cc.obj```
other 64bit Jenkins builds seem to build fine
anybody has any recommendations to fix this error ?
2024-10-14T11:02:53.418Z
<rzarzynski> Hi @yuriw and @Laura Flores! What's the status of <https://github.com/ceph/ceph/pull/60158>? I bet it's already prioritized but want to check after the Sepia'a outage
2024-10-14T16:14:12.511Z
<Adam Kupczyk> @rzarzynski <https://github.com/ceph/ceph/pull/60158#issuecomment-2411692127> The PR checks out as much as can be tested.
2024-10-14T18:41:42.031Z
<ljon> I have a situation for writting data to ceph rbd image using their librbd c++ or go-ceph library. Let us use go-ceph as example. I found that if osd or mon is not responsive, rbd write operation or read etc operations will hang forever. Are there any ways to make them timeout? So that my application can continue.
For example, if rbd write timed out, I assume it will return say ETIMEOUT then my application can catch this error and then close the image handle and do other error handling.
The reason I want to run close function when write hangs is because at least, i can release the handle, so that this image can be changed(written or remove or create snap etc) by other threads.
I understand that there are various reason could lead to write() hang, for example, network connection issue, or say one of OSD is full. When answering this question, could you please pick multiple these situations as examples, and explain them. Also please provide comment on whether calling close() function is a good solution at these kind of situation(similar ones such as read() hangs, snap purge hangs etc).
Thanks for help

Update:
2024-10-14T18:42:11.398Z
<ljon> I have a situation for writting data to ceph rbd image using their librbd c++ or go-ceph library. Let us use go-ceph as example. I found that if osd or mon is not responsive, rbd write operation or read etc operations will hang forever. Are there any ways to make them timeout? So that my application can continue.
For example, if rbd write timed out, I assume it will return say ETIMEOUT then my application can catch this error and then close the image handle and do other error handling.

Thanks for help

Update:
2024-10-14T18:55:07.587Z
<ljon> I have a situation for writting data to ceph rbd image using their librbd c++ or go-ceph library. Let us use go-ceph as example. I found that if osd or mon is not responsive or ceph is in HEALTH_ERR, rbd write operation or read etc operations will hang forever. Are there any ways to make them timeout? So that my application can continue.

For example, if rbd write timed out, I assume it will return say -ETIMEOUT then my application can catch this error and then close the image handle and do other error handling.

So for my application, I would like to achieve: for any rbd operation, I do not want them to hang forever, the rbd operation shall have a timeout. If the operation is timeout, error shall return.

My question is how can I achieve that?

My current code is able to connect to ceph, create rbd image, export, import rbd images by using go-ceph go library. The missing part is the error handling when rbd operation hangs.

Thanks for help

Update:
My current solution from simple trail and error is:
use `func (c *Conn) SetConfigOption(option, value string) error` function in go-ceph library to set per-connection configurations.
I found 3 related ceph settings that might be helpful:
`client_mount_timeout`,
`rados_mon_op_timeout`,
`rados_osd_op_timeout`

Is what I am doing a correct direction in making rbd operation not hang forever? If yes, are any other setting items do you recommend? Please also make comment on anything you can think of for this topic.

Thanks.
2024-10-14T18:56:09.298Z
<ljon> I have a situation for writting data to ceph rbd image using their librbd c++ or go-ceph library. Let us use go-ceph as example. I found that if osd or mon is not responsive or ceph is in HEALTH_ERR, rbd write operation or read etc operations will hang forever. Are there any ways to make them timeout? So that my application can continue.

For example, if rbd write timed out, I assume it will return say -ETIMEOUT then my application can catch this error and then close the image handle and do other error handling.

So for my application, I would like to achieve: for any rbd operation, I do not want them to hang forever, the rbd operation shall have a timeout. If the operation is timeout, error shall return.

My question is how can I achieve that?

My current code is able to connect to ceph, create rbd image, export, import rbd images by using go-ceph go library. The missing part is the error handling when rbd operation hangs.

Thanks for help

Update:
My current solution from simple trail and error is:
use `func (c *Conn) SetConfigOption(option, value string) error` function in go-ceph library to set per-connection configurations.
I found 3 related ceph settings that might be helpful:
`client_mount_timeout`,
`rados_mon_op_timeout`,
`rados_osd_op_timeout`

Is what I am doing a correct direction in making rbd operation not hang forever? If yes, are any other setting items do you recommend? Please also make comment on anything you can think of for this topic.

Thanks.
2024-10-14T18:56:35.557Z
<ljon> I have a situation for writting data to ceph rbd image using their librbd c++ or go-ceph library. Let us use go-ceph as example. I found that if osd or mon is not responsive or ceph is in HEALTH_ERR, rbd write operation or read etc operations will hang forever. Are there any ways to make them timeout? So that my application can continue.

For example, if rbd write timed out, I assume it will return say -ETIMEOUT then my application can catch this error and then close the image handle and do other error handling.

So for my application, I would like to achieve: for any rbd operation, I do not want them to hang forever, the rbd operation shall have a timeout. If the operation is timeout, error shall return.

My question is how can I achieve that?

My current code is able to connect to ceph, create rbd image, export, import rbd images by using go-ceph go library. The missing part is the error handling when rbd operation hangs.

Thanks for help

Update:
My current solution from simple trail and error is:
use `func (c *Conn) SetConfigOption(option, value string) error` function in go-ceph library to set per-connection configurations.
I found 3 related ceph settings that might be helpful:
`client_mount_timeout`,
`rados_mon_op_timeout`,
`rados_osd_op_timeout`

Is what I am doing a correct direction in making rbd operation not hang forever? If yes, are any other setting items do you recommend? Please also make comment on anything you can think of for this topic.

Thanks.

One reference I found: <https://listman.redhat.com/archives/libvir-list/2014-February/msg01523.html>
2024-10-14T19:24:14.311Z
<gregsfortytwo> I believe you have found the right config options @ljon. Be aware that these are not well tested
2024-10-14T19:25:39.506Z
<John Mulligan> if you dig thru the go-ceph issues you will find others using those same timout values to accomplish similar goals
2024-10-14T19:27:38.264Z
<John Mulligan> one other technique you can consider is using a separate goroutine(s) to handle ceph related operations and use the other goroutine to monitor how long the ceph op gorountine has taken.
Note that this approach is much more complex and error handling will be tricky and so may terminating the connection to ceph. But I do think it's viable depending on the architecture of your application

Any issue? please create an issue here and use the infra label.