ceph - ceph-devel - 2024-07-19

Timestamp (UTC)Message
2024-07-19T06:34:14.599Z
<Ashutosh Sharma> radosgw-admin role create --role-name=test-role-1 --assume-role-policy-doc=assume-policy-test-role-1.json

// assume role policy
{
  "Version": "2012-10-17",
  "Statement": [
	{
  	"Effect": "Allow",
  	"Principal": {"AWS": ["arn:aws:iam:::user/test-user-1"]},
  	"Action": ["sts:AssumeRole"]
	}
  ]
}

radosgw-admin role policy put --role-name=test-role-1 --policy-name=role1Policy --policy-doc="$(cat role-policy-test-role-1.json)"

// role-policy
{
  "Version": "2012-10-17",
  "Statement": [
	{
  	"Effect": "Allow",
  	"Action": [
    	"s3:ListBucket"
  	],
  	"Resource": [
    	"arn:aws:s3:::test-bucket-1"
  	]
	},
	{
  	"Effect": "Allow",
  	"Action": [
    	"s3:GetObject",
    	"s3:PutObject",
    	"s3:DeleteObject"
  	],
  	"Resource": [
    	"arn:aws:s3:::test-bucket-1/*"
  	]
	}
  ]


I had created role using this, and policy was reflected, but it has not been assigned to the user.
i had also tried this:
radosgw-admin caps add --uid="test-user-1" --caps="roles=test-role-1"

But still it is not working. In the documentation it does not specify how to attach a role to user. Suggestions are appreciated.
2024-07-19T06:39:15.716Z
<Ashutosh Sharma> The owner of the test-bucket-1 is another user.
2024-07-19T08:10:24.847Z
<Ilya Dryomov> @rzarzynski Hit the following assert in teuthology on yesterday's main with some RBD patches added:
```2024-07-19T02:41:53.779 INFO:tasks.ceph.osd.0.smithi037.stderr:./src/osd/ECCommon.cc: In function 'static void ECCommon::ReadPipeline::get_min_want_to_read_shards(uint64_t, uint64_t, const ECUtil::stripe_info_t&, const std::vector<int>&, std::set<int>*)' thread 7f1caaa98640 time 2024-07-19T02:41:53.781585+0000
2024-07-19T02:41:53.780 INFO:tasks.ceph.osd.0.smithi037.stderr:./src/osd/ECCommon.cc: 336: FAILED ceph_assert(want_to_read->size() == sinfo.get_data_chunk_count())
2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: ceph version 19.0.0-5126-g01e1aa05 (01e1aa052c15e2a077a20de402d3ae763de4a30d) squid (dev)
2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x118) [0x55a0ce594810]
2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: 2: ceph-osd(+0x4009c7) [0x55a0ce5949c7]
2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: 3: ceph-osd(+0x3a709a) [0x55a0ce53b09a]
2024-07-19T02:41:53.786 INFO:tasks.ceph.osd.0.smithi037.stderr: 4: (ECCommon::ReadPipeline::get_min_want_to_read_shards(unsigned long, unsigned long, std::set<int, std::less<int>, std::allocator<int> >*)+0x56) [0x55a0ce881da6]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 5: (ECCommon::ReadPipeline::objects_read_and_reconstruct(std::map<hobject_t, std::__cxx11::list<ECCommon::ec_align_t, std::allocator<ECCommon::ec_align_t> >, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::__cxx11::list<ECCommon::ec_align_t, std::allocator<ECCommon::ec_align_t> > > > > const&, bool, std::unique_ptr<GenContext<std::map<hobject_t, ECCommon::ec_extent_t, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, ECCommon::ec_extent_t> > >&&>, std::default_delete<GenContext<std::map<hobject_t, ECCommon::ec_extent_t, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, ECCommon::ec_extent_t> > >&&> > >&&)+0x8f7) [0x55a0ce888a17]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 6: (ECBackend::objects_read_async(hobject_t const&, std::__cxx11::list<std::pair<ECCommon::ec_align_t, std::pair<ceph::buffer::v15_2_0::list*, Context*> >, std::allocator<std::pair<ECCommon::ec_align_t, std::pair<ceph::buffer::v15_2_0::list*, Context*> > > > const&, Context*, bool)+0x5a0) [0x55a0cea98d70]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 7: (PrimaryLogPG::OpContext::start_async_reads(PrimaryLogPG*)+0x179) [0x55a0ce7cd669]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 8: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x473) [0x55a0ce7f6153]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x30f3) [0x55a0ce7e1b83]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x197) [0x55a0ce730527]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 11: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a0ce973de9]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xab3) [0x55a0ce73b0d3]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x293) [0x55a0cec36b53]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 14: ceph-osd(+0xaa30b4) [0x55a0cec370b4]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f1cc9a42b43]
2024-07-19T02:41:53.787 INFO:tasks.ceph.osd.0.smithi037.stderr: 16: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f1cc9ad4a00]
2024-07-19T02:41:53.788 INFO:tasks.ceph.osd.0.smithi037.stderr:*** Caught signal (Aborted) **
2024-07-19T02:41:53.788 INFO:tasks.ceph.osd.0.smithi037.stderr: in thread 7f1caaa98640 thread_name:tp_osd_tp
2024-07-19T02:41:53.788 INFO:tasks.ceph.osd.0.smithi037.stderr:2024-07-19T02:41:53.782+0000 7f1caaa98640 -1 ./src/osd/ECCommon.cc: In function 'static void ECCommon::ReadPipeline::get_min_want_to_read_shards(uint64_t, uint64_t, const ECUtil::stripe_info_t&, const std::vector<int>&, std::set<int>*)' thread 7f1caaa98640 time 2024-07-19T02:41:53.781585+0000
2024-07-19T02:41:53.788 INFO:tasks.ceph.osd.0.smithi037.stderr:./src/osd/ECCommon.cc: 336: FAILED ceph_assert(want_to_read->size() == sinfo.get_data_chunk_count())```
2024-07-19T08:12:17.256Z
<Ilya Dryomov> This snippet doesn't include the actual assert -- is there a `... FAILED ceph_assert ...` line somewhere in the output?
2024-07-19T08:15:35.799Z
<Ashutosh Sharma> @Casey Bodley
2024-07-19T09:15:17.501Z
<Armsby> this must be it
```     0> 2024-07-19T08:32:22.036+0000 7f0189839700  5 asok(0x55d2203d8000) unregister_commands rbd mirror restart images/15172b5b-45af-4f52-bed6-47fd57941b36```
2024-07-19T09:18:56.992Z
<Armsby> and it also looks like someone deleted that image, and that rbd-mirror fails due to that, but I can not disable it as it does not exist
2024-07-19T09:23:58.763Z
<Ilya Dryomov> Can you paste the full crash splat?
2024-07-19T09:36:51.899Z
<system> file crash dump.rtf too big to download (1736438 > allowed size: 1000000)
2024-07-19T09:36:51.900Z
<Armsby> here is the whole crash dump
2024-07-19T09:44:35.751Z
<Ilya Dryomov> Looks like it's failing to create a thread:
```/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7fa16f9cb700 time 2024-07-19T09:27:31.069680+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/common/Thread.cc: 165: FAILED ceph_assert(ret == 0)```
2024-07-19T09:44:54.555Z
<Ilya Dryomov> Perhaps you are running into a resource limit (ulimit)?
2024-07-19T09:47:18.026Z
<Armsby> I thought that, so I checked this is the ulimit in the container
```[root@mon-001 ~]# cephadm enter -n rbd-mirror.mon-001.lcqrti
Inferring fsid 96b20570-3e18-41fa-8fac-3772bc1494a8
[ceph: root@mon-001 /]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 513111
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4194304
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited```
2024-07-19T09:53:37.927Z
<Ilya Dryomov> The container runtime can also impose a limit, see `--pids-limit` and similar options
2024-07-19T09:56:02.474Z
<Ilya Dryomov> What is the output of `cat /sys/fs/cgroup/pids/pids.max` in the container?
2024-07-19T09:56:35.994Z
<Armsby> ``` cat /sys/fs/cgroup/pids/pids.max
2048```
2024-07-19T09:59:07.150Z
<Armsby> I have just update it to 4096 to see if that helped
2024-07-19T10:00:17.847Z
<Armsby> it did not
2024-07-19T10:18:40.030Z
<Ilya Dryomov> Have you checked that you are able to create 4096 threads instead of 2048
2024-07-19T10:18:46.991Z
<Ilya Dryomov> Have you checked that you are able to create 4096 threads instead of 2048?
2024-07-19T10:18:56.440Z
<Ilya Dryomov> Have you checked that you are able to create 4096 threads instead of 2048 after the update?
2024-07-19T10:20:10.540Z
<Ilya Dryomov> How many RBD images have mirroring enabled?  Has that number changed recently?
> it was working fine for weeks but yesterday the rbd-mirror on the old cluster keep crashing with this error
2024-07-19T10:20:26.521Z
<Ilya Dryomov> How many RBD images have mirroring enabled?  Did that number change yesterday?
> it was working fine for weeks but yesterday the rbd-mirror on the old cluster keep crashing with this error
2024-07-19T10:22:09.759Z
<Ilya Dryomov> Also, if you are in the process of migrating images from the old cluster, the fact the rbd-mirror daemon is crashing there can probably be ignored -- what matters in this case is the rbd-mirror daemon on the new cluster since it's doing all of the work
2024-07-19T10:27:57.748Z
<Ashutosh Sharma> radosgw-admin role create --role-name=test-role-1 --assume-role-policy-doc=assume-policy-test-role-1.json

// assume role policy
{
  "Version": "2012-10-17",
  "Statement": [
	{
  	"Effect": "Allow",
  	"Principal": {"AWS": ["arn:aws:iam:::user/test-user-1"]},
  	"Action": ["sts:AssumeRole"]
	}
  ]
}

radosgw-admin role policy put --role-name=test-role-1 --policy-name=role1Policy --policy-doc="$(cat role-policy-test-role-1.json)"

// role-policy
{
  "Version": "2012-10-17",
  "Statement": [
	{
  	"Effect": "Allow",
  	"Action": [
    	"s3:ListBucket"
  	],
  	"Resource": [
    	"arn:aws:s3:::test-bucket-1"
  	]
	},
	{
  	"Effect": "Allow",
  	"Action": [
    	"s3:GetObject",
    	"s3:PutObject",
    	"s3:DeleteObject"
  	],
  	"Resource": [
    	"arn:aws:s3:::test-bucket-1/*"
  	]
	}
  ]



I had created role using this, and policy was reflected, but it has not been assigned to the user.
i had also tried this:
radosgw-admin caps add --uid="test-user-1" --caps="roles=test-role-1"

But still it is not working. In the documentation it does not specify how to attach a role to user. Suggestions are appreciated.
2024-07-19T10:31:25.544Z
<Armsby> I have 5087 enabled, I did not add any yesterday, but it looks like someone deleted some in openstack
2024-07-19T10:32:21.885Z
<Armsby> the daemon in the new cluster goes in to error mode when it is down in the old cluster
2024-07-19T10:33:55.130Z
<Armsby> but doing this `podman update --pids-limit 102400 ceph-96b20570-3e18-41fa-8fac-3772bc1494a8-rbd-mirror-mon-001-lcqrti` seams to keep it up, now all the images are just in state up+stopped
2024-07-19T10:34:15.291Z
<Armsby> it might have been out of threads
2024-07-19T11:07:00.627Z
<rzarzynski> @Ilya Dryomov: it's a recently changed code. Do you have a link to the teuthology job?
2024-07-19T11:57:22.971Z
<Ilya Dryomov> <https://pulpito.ceph.com/dis-2024-07-18_22:11:30-rbd-wip-dis-testing-distro-default-smithi/7808087>
2024-07-19T11:57:49.708Z
<Ilya Dryomov> You should be able to trigger it by running `TestLibRBD.TestEncryptionLUKS1` test
2024-07-19T11:59:05.663Z
<Ilya Dryomov> ... with `rbd default data pool = $EC_POOL_NAME` in the config
2024-07-19T12:00:35.606Z
<Ilya Dryomov> ... with `rbd default data pool = $EC_POOL_NAME` in ceph.conf (with an EC pool pre-created)
2024-07-19T12:37:04.580Z
<Casey Bodley> <https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html>
2024-07-19T13:00:53.334Z
<rzarzynski> thanks!
2024-07-19T13:04:42.993Z
<Casey Bodley> i'm trying to help a user test a fix in <https://tracker.ceph.com/issues/66937#note-13>. any idea why they can't access containers built by ceph-ci?
> ```docker pull [quay.io/ceph-ci/ceph:wip-66937-squid](http://quay.io/ceph-ci/ceph:wip-66937-squid)
> Error response from daemon: unauthorized: access to the requested resource is not authorized```
2024-07-19T14:01:18.693Z
<Casey Bodley> is that domain name supposed to be `[quay.ceph.io](http://quay.ceph.io)` perhaps?
2024-07-19T14:22:00.950Z
<John Mulligan> yeah, I think that is correct. it needs to be `[quay.ceph.io](http://quay.ceph.io)`
2024-07-19T14:22:46.391Z
<Casey Bodley> thanks John. i opened <https://github.com/ceph/ceph/pull/58678> to fix the cephadm doc there
2024-07-19T14:23:31.166Z
<John Mulligan> I approved it
2024-07-19T16:08:30.042Z
<Rost Khudov> Hello, I have a question
Since the minimal supported version for ceph in main branch is 3.9 ([CMakeLists.txt](https://github.com/ceph/ceph/blob/main/CMakeLists.txt#L584))
Should we change default `python3_pkgversion`, `python3_version_nodots` and `python3_version` in [ceph.spec.in](http://ceph.spec.in) file from 3 to 3.9 as well?

When I am trying to do rpmbuild, it search for python3-devel instead of python3.9-devel
2024-07-19T16:19:57.385Z
<Casey Bodley> i think `python3-devel` is what we want, so we can use the distro's default python version. most are past 3.9 by now. what distro/version are you trying to build on?
2024-07-19T16:20:45.050Z
<Rost Khudov> rhel8
2024-07-19T16:20:53.098Z
<Rost Khudov> and for it the default version is 3.6
2024-07-19T16:22:22.081Z
<Casey Bodley> we've dropped official support for centos 8 and rhel 8 for the squid/main branches. i'd suggest using a newer distro
2024-07-19T16:23:06.957Z
<Rost Khudov> oh, it was in some PR? looks like I missed it
2024-07-19T16:23:07.674Z
<Casey Bodley> you're welcome to hack [ceph.spec.in](http://ceph.spec.in) yourself to work around that but i don't think we want to change how that works on main
2024-07-19T17:24:11.347Z
<Joseph Mundackal> RHEL 8 is not EOL - so why did we explicity drop support for it?
2024-07-19T17:25:56.656Z
<Casey Bodley> we didn't ever explicitly support rhel. centos has been our proxy for rhel support. we were planning to drop centos 8 for squid regardless of eol
2024-07-19T17:27:33.287Z
<Joseph Mundackal> ah i see - so rhel 8 got dropped from the virtue of centos 8 going away

Any issue? please create an issue here and use the infra label.