ceph - cephfs - 2024-06-12

Timestamp (UTC)Message
2024-06-12T01:56:15.447Z
<Xiubo Li> @Patrick Donnelly BTW, do you know which mds option could make the dir fragments to be balanced and migrated to other MDSs quickly ?
2024-06-12T02:02:03.313Z
<Xiubo Li> @Patrick Donnelly @Venky Shankar BTW, do you know which mds option could make the dir fragments to be balanced and migrated to other MDSs quickly ?
2024-06-12T04:06:26.735Z
<Jos Collin> this doesn't seem an issue with mirroring

```<error>
  <unique>0x2bb</unique>
  <tid>1</tid>
  <kind>Leak_StillReachable</kind>
  <xwhat>
    <text>36 bytes in 1 blocks are still reachable in loss record 3 of 11</text>
    <leakedbytes>36</leakedbytes>
    <leakedblocks>1</leakedblocks>
  </xwhat>
  <stack>
    <frame>
      <ip>0x484480F</ip>
      <obj>/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so</obj>
      <fn>malloc</fn>
      <dir>/builddir/build/BUILD/valgrind-3.22.0/coregrind/m_replacemalloc</dir>
      <file>vg_replace_malloc.c</file>
      <line>442</line>
    </frame>
    <frame>
      <ip>0x402382F</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>malloc</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/string/../include</dir>
      <file>rtld-malloc.h</file>
      <line>56</line>
    </frame>
    <frame>
      <ip>0x402382F</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>strdup</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/string</dir>
      <file>strdup.c</file>
      <line>42</line>
    </frame>
    <frame>
      <ip>0x4014677</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>_dl_load_cache_lookup</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-cache.c</file>
      <line>517</line>
    </frame>
    <frame>
      <ip>0x40089B7</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>_dl_map_object</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-load.c</file>
      <line>2152</line>
    </frame>
    <frame>
      <ip>0x400C3B9</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>dl_open_worker_begin</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-open.c</file>
      <line>577</line>
    </frame>
    <frame>
      <ip>0x5B8F147</ip>
      <obj>/usr/lib64/libc.so.6</obj>
      <fn>_dl_catch_exception</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-error-skeleton.c</file>
      <line>208</line>
    </frame>
    <frame>
      <ip>0x400BAF9</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>dl_open_worker</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-open.c</file>
      <line>796</line>
    </frame>
    <frame>
      <ip>0x5B8F147</ip>
      <obj>/usr/lib64/libc.so.6</obj>
      <fn>_dl_catch_exception</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-error-skeleton.c</file>
      <line>208</line>
    </frame>
    <frame>
      <ip>0x400BF5E</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>_dl_open</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-open.c</file>
      <line>898</line>
    </frame>
    <frame>
      <ip>0x5ABECBB</ip>
      <obj>/usr/lib64/libc.so.6</obj>
      <fn>dlopen_doit</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/dlfcn</dir>
      <file>dlopen.c</file>
      <line>56</line>
    </frame>
    <frame>
      <ip>0x5B8F147</ip>
      <obj>/usr/lib64/libc.so.6</obj>
      <fn>_dl_catch_exception</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-error-skeleton.c</file>
      <line>208</line>
    </frame>
    <frame>
      <ip>0x5B8F212</ip>
      <obj>/usr/lib64/libc.so.6</obj>
      <fn>_dl_catch_error</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-error-skeleton.c</file>
      <line>227</line>
    </frame>
    <frame>
      <ip>0x5ABE78D</ip>
      <obj>/usr/lib64/libc.so.6</obj>
      <fn>_dlerror_run</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/dlfcn</dir>
      <file>dlerror.c</file>
      <line>138</line>
    </frame>
    <frame>
      <ip>0x5ABED70</ip>
      <obj>/usr/lib64/libc.so.6</obj>
      <fn>dlopen_implementation</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/dlfcn</dir>
      <file>dlopen.c</file>
      <line>71</line>
    </frame>
    <frame>
      <ip>0x5ABED70</ip>
      <obj>/usr/lib64/libc.so.6</obj>
      <fn>dlopen@@GLIBC_2.34</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/dlfcn</dir>
      <file>dlopen.c</file>
      <line>81</line>
    </frame>
    <frame>
      <ip>0x5039C0F</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>_sub_I_65535_0.0</fn>
    </frame>
    <frame>
      <ip>0x400507D</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>call_init</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-init.c</file>
      <line>70</line>
    </frame>
    <frame>
      <ip>0x400507D</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>call_init</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-init.c</file>
      <line>26</line>
    </frame>
    <frame>
      <ip>0x400516B</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
      <fn>_dl_init</fn>
      <dir>/usr/src/debug/glibc-2.34-105.el9.x86_64/elf</dir>
      <file>dl-init.c</file>
      <line>117</line>
    </frame>
    <frame>
      <ip>0x401CC29</ip>
      <obj>/usr/lib64/ld-linux-x86-64.so.2</obj>
    </frame>
    <frame>
      <ip>0x4</ip>
    </frame>
    <frame>
      <ip>0x1FFF000B82</ip>
    </frame>
    <frame>
      <ip>0x1FFF000B90</ip>
    </frame>
    <frame>
      <ip>0x1FFF000B9A</ip>
    </frame>
    <frame>
      <ip>0x1FFF000B9F</ip>
    </frame>
    <frame>
      <ip>0x1FFF000BA4</ip>
    </frame>
  </stack>
</error>

</valgrindoutput>```
2024-06-12T04:06:48.078Z
<Jos Collin> @Venky Shankar
2024-06-12T05:37:22.900Z
<Venky Shankar> reaplication factor?
2024-06-12T05:37:33.016Z
<Venky Shankar> `mds_bal_replicate_threshold`
2024-06-12T05:37:57.327Z
<Venky Shankar> lowering this causes the subtree to be much frequently replicated
2024-06-12T05:38:27.735Z
<Xiubo Li> Okay, I will try it. BTW, is any load needed to  trigger it with this ?
2024-06-12T05:39:50.416Z
<Venky Shankar> nothing specific I think, but some workunit from fs:workload would suffice
2024-06-12T05:40:35.762Z
<Xiubo Li> Sure, let me have a try. thanks very much @Venky Shankar
2024-06-12T05:58:56.665Z
<Rishabh Dave> @Venky Shankar In context to <https://github.com/ceph/ceph/pull/54620#discussion_r1634309624>, `get_ceph_shell_stdout` is needed for tests in this PR. Can we keep it? There are similar methods but running Ceph commands (`raw_cluster_cmd()`, `get_ceph_cmd_stdout()`)
2024-06-12T07:00:04.269Z
<Rishabh Dave> 2nd question -
I am fine with current changes you quoted here - <https://github.com/ceph/ceph/pull/54620#discussion_r1631026357>. Average of percentage and percentage of average is same IMO. Let me know if you are not fine with the changes here and I need to replace it.
2024-06-12T07:00:57.926Z
<Venky Shankar> Im fine with that too
2024-06-12T07:01:04.069Z
<Venky Shankar> but just comment it out
2024-06-12T07:01:25.964Z
<Venky Shankar> the more pressing point is the condition for >1.0 check
2024-06-12T07:01:53.362Z
<Rishabh Dave> > but just comment it out
okay, will get it done.
2024-06-12T07:03:29.900Z
<Rishabh Dave> > the more pressing point is the condition for >1.0 check
i was about to ask about that too. i think you are referring to this thread, right? <https://github.com/ceph/ceph/pull/54620#discussion_r1631000670>
2024-06-12T07:04:39.910Z
<Rishabh Dave> it's just defensive programming.
2024-06-12T07:04:59.272Z
<Venky Shankar> that's done when there is a unknown bug
2024-06-12T07:06:27.349Z
<Rishabh Dave> okay, i was just being extra defensive but i've decided to remove it since it causes confusion to the reader. i'll push the fix in some time.
2024-06-12T07:10:39.622Z
<Venky Shankar> sure
2024-06-12T07:11:46.212Z
<Rishabh Dave> In context to <https://github.com/ceph/ceph/pull/54620#discussion_r1634309624>, `get_ceph_shell_stdout()` is very useful  for >1000 LoC tests that have been added in this PR. Can we keep it? There are similar methods for running Ceph commands (`raw_cluster_cmd()`, `get_ceph_cmd_stdout()`)
2024-06-12T07:12:07.145Z
<Venky Shankar> I'd expect this reply in the PR please
2024-06-12T07:12:22.241Z
<Venky Shankar> I think we already spoke about this?
2024-06-12T07:12:31.119Z
<Venky Shankar> its a straightforward change
2024-06-12T07:12:51.193Z
<Venky Shankar> but the only reason I'm not inclined in including in this pr is the code churn
2024-06-12T07:16:02.063Z
<Rishabh Dave> > is the code churn
I agree which is why I am using this method only in new tests that are being added in the PR and not anywhere else.
2024-06-12T07:16:38.112Z
<Rishabh Dave> no intention to refactor any existing QA code with it
2024-06-12T07:17:18.335Z
<Venky Shankar> I'll recheck when I review the change again.
2024-06-12T07:17:34.517Z
<Venky Shankar> Seems like something we can keep in this change then.
2024-06-12T07:17:46.563Z
<Venky Shankar> Let's focus on other changes/comments first.
2024-06-12T07:18:10.120Z
<Rishabh Dave> yes, getting rest of changes incorporated
2024-06-12T07:22:20.265Z
<Rishabh Dave> thanks @Venky Shankar!
2024-06-12T09:17:52.681Z
<Neeraj Pratap Singh> @Venky Shankar @Rishabh Dave if you are collecting PRs for teuthology testing, <https://github.com/ceph/ceph/pull/49974> is ready for testing, pls include it.
2024-06-12T14:34:40.206Z
<Jos Collin> @Venky Shankar Please approve <https://github.com/ceph/ceph/pull/56700>.

Any issue? please create an issue here and use the infra label.