ceph - ceph-devel - 2024-07-19

Timestamp (UTC)	Message
2024-07-19T06:34:14.599Z	<Ashutosh Sharma> radosgw-admin role create --role-name=test-role-1 --assume-role-policy-doc=assume-policy-test-role-1.json // assume role policy { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": {"AWS": ["arn:aws:iam:::user/test-user-1"]}, "Action": ["sts:AssumeRole"] } ] } radosgw-admin role policy put --role-name=test-role-1 --policy-name=role1Policy --policy-doc="$(cat role-policy-test-role-1.json)" // role-policy { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::test-bucket-1" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::test-bucket-1/*" ] } ] I had created role using this, and policy was reflected, but it has not been assigned to the user. i had also tried this: radosgw-admin caps add --uid="test-user-1" --caps="roles=test-role-1" But still it is not working. In the documentation it does not specify how to attach a role to user. Suggestions are appreciated.
2024-07-19T06:39:15.716Z	<Ashutosh Sharma> The owner of the test-bucket-1 is another user.
2024-07-19T08:10:24.847Z	<Ilya Dryomov> @rzarzynski Hit the following assert in teuthology on yesterday's main with some RBD patches added: ```2024-07-19T02:41:53.779 INFO:tasks.ceph.osd.0.smithi037.stderr:./src/osd/ECCommon.cc: In function 'static void ECCommon::ReadPipeline::get_min_want_to_read_shards(uint64_t, uint64_t, const ECUtil::stripe_info_t&, const std::vector<int>&, std::set<int>)' thread 7f1caaa98640 time 2024-07-19T02:41:53.781585+0000 2024-07-19T02:41:53.780 INFO:tasks.ceph.osd.0.smithi037.stderr:./src/osd/ECCommon.cc: 336: FAILED ceph_assert(want_to_read->size() == sinfo.get_data_chunk_count()) 2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: ceph version 19.0.0-5126-g01e1aa05 (01e1aa052c15e2a077a20de402d3ae763de4a30d) squid (dev) 2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: 1: (ceph::__ceph_assert_fail(char const, char const, int, char const)+0x118) [0x55a0ce594810] 2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: 2: ceph-osd(+0x4009c7) [0x55a0ce5949c7] 2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: 3: ceph-osd(+0x3a709a) [0x55a0ce53b09a] 2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: 4: (ECCommon::ReadPipeline::get_min_want_to_read_shards(unsigned long, unsigned long, std::set<int, std::less<int>, std::allocator<int> >)+0x56) [0x55a0ce881da6] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 5: (ECCommon::ReadPipeline::objects_read_and_reconstruct(std::map<hobject_t, std::__cxx11::list<ECCommon::ec_align_t, std::allocator<ECCommon::ec_align_t> >, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::__cxx11::list<ECCommon::ec_align_t, std::allocator<ECCommon::ec_align_t> > > > > const&, bool, std::unique_ptr<GenContext<std::map<hobject_t, ECCommon::ec_extent_t, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, ECCommon::ec_extent_t> > >&&>, std::default_delete<GenContext<std::map<hobject_t, ECCommon::ec_extent_t, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, ECCommon::ec_extent_t> > >&&> > >&&)+0x8f7) [0x55a0ce888a17] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 6: (ECBackend::objects_read_async(hobject_t const&, std::__cxx11::list<std::pair<ECCommon::ec_align_t, std::pair<ceph::buffer::v15_2_0::list, Context> >, std::allocator<std::pair<ECCommon::ec_align_t, std::pair<ceph::buffer::v15_2_0::list, Context> > > > const&, Context, bool)+0x5a0) [0x55a0cea98d70] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 7: (PrimaryLogPG::OpContext::start_async_reads(PrimaryLogPG)+0x179) [0x55a0ce7cd669] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 8: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext)+0x473) [0x55a0ce7f6153] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x30f3) [0x55a0ce7e1b83] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x197) [0x55a0ce730527] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 11: (ceph::osd::scheduler::PGOpItem::run(OSD, OSDShard, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a0ce973de9] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d)+0xab3) [0x55a0ce73b0d3] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x293) [0x55a0cec36b53] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 14: ceph-osd(+0xaa30b4) [0x55a0cec370b4] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f1cc9a42b43] 2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 16: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f1cc9ad4a00] 2024-07-19T02:41:53.788 INFO:tasks.ceph.osd.0.smithi037.stderr: Caught signal (Aborted) 2024-07-19T02:41:53.788 INFO:tasks.ceph.osd.0.smithi037.stderr: in thread 7f1caaa98640 thread_name:tp_osd_tp 2024-07-19T02:41:53.788 INFO:tasks.ceph.osd.0.smithi037.stderr:2024-07-19T02:41:53.782+0000 7f1caaa98640 -1 ./src/osd/ECCommon.cc: In function 'static void ECCommon::ReadPipeline::get_min_want_to_read_shards(uint64_t, uint64_t, const ECUtil::stripe_info_t&, const std::vector<int>&, std::set<int>*)' thread 7f1caaa98640 time 2024-07-19T02:41:53.781585+0000 2024-07-19T02:41:53.788 INFO:tasks.ceph.osd.0.smithi037.stderr:./src/osd/ECCommon.cc: 336: FAILED ceph_assert(want_to_read->size() == sinfo.get_data_chunk_count())```
2024-07-19T08:12:17.256Z	<Ilya Dryomov> This snippet doesn't include the actual assert -- is there a `... FAILED ceph_assert ...` line somewhere in the output?
2024-07-19T08:15:35.799Z	<Ashutosh Sharma> @Casey Bodley
2024-07-19T09:15:17.501Z	<Armsby> this must be it ``` 0> 2024-07-19T08:32:22.036+0000 7f0189839700 5 asok(0x55d2203d8000) unregister_commands rbd mirror restart images/15172b5b-45af-4f52-bed6-47fd57941b36```
2024-07-19T09:18:56.992Z	<Armsby> and it also looks like someone deleted that image, and that rbd-mirror fails due to that, but I can not disable it as it does not exist
2024-07-19T09:23:58.763Z	<Ilya Dryomov> Can you paste the full crash splat?
2024-07-19T09:36:51.899Z	<system> file crash dump.rtf too big to download (1736438 > allowed size: 1000000)
2024-07-19T09:36:51.900Z	<Armsby> here is the whole crash dump
2024-07-19T09:44:35.751Z	<Ilya Dryomov> Looks like it's failing to create a thread: ```/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7fa16f9cb700 time 2024-07-19T09:27:31.069680+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/common/Thread.cc: 165: FAILED ceph_assert(ret == 0)```
2024-07-19T09:44:54.555Z	<Ilya Dryomov> Perhaps you are running into a resource limit (ulimit)?
2024-07-19T09:47:18.026Z	<Armsby> I thought that, so I checked this is the ulimit in the container ```[root@mon-001 ~]# cephadm enter -n rbd-mirror.mon-001.lcqrti Inferring fsid 96b20570-3e18-41fa-8fac-3772bc1494a8 [ceph: root@mon-001 /]# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 513111 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4194304 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited```
2024-07-19T09:53:37.927Z	<Ilya Dryomov> The container runtime can also impose a limit, see `--pids-limit` and similar options
2024-07-19T09:56:02.474Z	<Ilya Dryomov> What is the output of `cat /sys/fs/cgroup/pids/pids.max` in the container?
2024-07-19T09:56:35.994Z	<Armsby> ``` cat /sys/fs/cgroup/pids/pids.max 2048```
2024-07-19T09:59:07.150Z	<Armsby> I have just update it to 4096 to see if that helped
2024-07-19T10:00:17.847Z	<Armsby> it did not
2024-07-19T10:18:40.030Z	<Ilya Dryomov> Have you checked that you are able to create 4096 threads instead of 2048
2024-07-19T10:18:46.991Z	<Ilya Dryomov> Have you checked that you are able to create 4096 threads instead of 2048?
2024-07-19T10:18:56.440Z	<Ilya Dryomov> Have you checked that you are able to create 4096 threads instead of 2048 after the update?
2024-07-19T10:20:10.540Z	<Ilya Dryomov> How many RBD images have mirroring enabled? Has that number changed recently? > it was working fine for weeks but yesterday the rbd-mirror on the old cluster keep crashing with this error
2024-07-19T10:20:26.521Z	<Ilya Dryomov> How many RBD images have mirroring enabled? Did that number change yesterday? > it was working fine for weeks but yesterday the rbd-mirror on the old cluster keep crashing with this error
2024-07-19T10:22:09.759Z	<Ilya Dryomov> Also, if you are in the process of migrating images from the old cluster, the fact the rbd-mirror daemon is crashing there can probably be ignored -- what matters in this case is the rbd-mirror daemon on the new cluster since it's doing all of the work
2024-07-19T10:27:57.748Z	<Ashutosh Sharma> radosgw-admin role create --role-name=test-role-1 --assume-role-policy-doc=assume-policy-test-role-1.json // assume role policy { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": {"AWS": ["arn:aws:iam:::user/test-user-1"]}, "Action": ["sts:AssumeRole"] } ] } radosgw-admin role policy put --role-name=test-role-1 --policy-name=role1Policy --policy-doc="$(cat role-policy-test-role-1.json)" // role-policy { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::test-bucket-1" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::test-bucket-1/*" ] } ] I had created role using this, and policy was reflected, but it has not been assigned to the user. i had also tried this: radosgw-admin caps add --uid="test-user-1" --caps="roles=test-role-1" But still it is not working. In the documentation it does not specify how to attach a role to user. Suggestions are appreciated.
2024-07-19T10:31:25.544Z	<Armsby> I have 5087 enabled, I did not add any yesterday, but it looks like someone deleted some in openstack
2024-07-19T10:32:21.885Z	<Armsby> the daemon in the new cluster goes in to error mode when it is down in the old cluster
2024-07-19T10:33:55.130Z	<Armsby> but doing this `podman update --pids-limit 102400 ceph-96b20570-3e18-41fa-8fac-3772bc1494a8-rbd-mirror-mon-001-lcqrti` seams to keep it up, now all the images are just in state up+stopped
2024-07-19T10:34:15.291Z	<Armsby> it might have been out of threads
2024-07-19T11:07:00.627Z	<rzarzynski> @Ilya Dryomov: it's a recently changed code. Do you have a link to the teuthology job?
2024-07-19T11:57:22.971Z	<Ilya Dryomov> <https://pulpito.ceph.com/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7808087>
2024-07-19T11:57:49.708Z	<Ilya Dryomov> You should be able to trigger it by running `TestLibRBD.TestEncryptionLUKS1` test
2024-07-19T11:59:05.663Z	<Ilya Dryomov> ... with `rbd default data pool = $EC_POOL_NAME` in the config
2024-07-19T12:00:35.606Z	<Ilya Dryomov> ... with `rbd default data pool = $EC_POOL_NAME` in ceph.conf (with an EC pool pre-created)
2024-07-19T12:37:04.580Z	<Casey Bodley> <https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html>
2024-07-19T13:00:53.334Z	<rzarzynski> thanks!
2024-07-19T13:04:42.993Z	<Casey Bodley> i'm trying to help a user test a fix in <https://tracker.ceph.com/issues/66937#note-13>. any idea why they can't access containers built by ceph-ci? > ```docker pull [quay.io/ceph-ci/ceph:wip-66937-squid](http://quay.io/ceph-ci/ceph:wip-66937-squid) > Error response from daemon: unauthorized: access to the requested resource is not authorized```
2024-07-19T14:01:18.693Z	<Casey Bodley> is that domain name supposed to be `[quay.ceph.io](http://quay.ceph.io)` perhaps?
2024-07-19T14:22:00.950Z	<John Mulligan> yeah, I think that is correct. it needs to be `[quay.ceph.io](http://quay.ceph.io)`
2024-07-19T14:22:46.391Z	<Casey Bodley> thanks John. i opened <https://github.com/ceph/ceph/pull/58678> to fix the cephadm doc there
2024-07-19T14:23:31.166Z	<John Mulligan> I approved it
2024-07-19T16:08:30.042Z	<Rost Khudov> Hello, I have a question Since the minimal supported version for ceph in main branch is 3.9 ([CMakeLists.txt](https://github.com/ceph/ceph/blob/main/CMakeLists.txt#L584)) Should we change default `python3_pkgversion`, `python3_version_nodots` and `python3_version` in [ceph.spec.in](http://ceph.spec.in) file from 3 to 3.9 as well? When I am trying to do rpmbuild, it search for python3-devel instead of python3.9-devel
2024-07-19T16:19:57.385Z	<Casey Bodley> i think `python3-devel` is what we want, so we can use the distro's default python version. most are past 3.9 by now. what distro/version are you trying to build on?
2024-07-19T16:20:45.050Z	<Rost Khudov> rhel8
2024-07-19T16:20:53.098Z	<Rost Khudov> and for it the default version is 3.6
2024-07-19T16:22:22.081Z	<Casey Bodley> we've dropped official support for centos 8 and rhel 8 for the squid/main branches. i'd suggest using a newer distro
2024-07-19T16:23:06.957Z	<Rost Khudov> oh, it was in some PR? looks like I missed it
2024-07-19T16:23:07.674Z	<Casey Bodley> you're welcome to hack [ceph.spec.in](http://ceph.spec.in) yourself to work around that but i don't think we want to change how that works on main
2024-07-19T17:24:11.347Z	<Joseph Mundackal> RHEL 8 is not EOL - so why did we explicity drop support for it?
2024-07-19T17:25:56.656Z	<Casey Bodley> we didn't ever explicitly support rhel. centos has been our proxy for rhel support. we were planning to drop centos 8 for squid regardless of eol
2024-07-19T17:27:33.287Z	<Joseph Mundackal> ah i see - so rhel 8 got dropped from the virtue of centos 8 going away

ceph - ceph-devel - 2024-07-19

Any issue? please create an issue here and use the infra label.