Previous: Planning / Preparing servers Next: Installing KVM
The plan is to use 10.0.4.0/24 for the public network and 10.0.5.0/24 for the cluster network.
DNS Entries updated as follows
10.0.4.1 ceph1 10.0.4.2 ceph2 10.0.4.3 ceph3 10.0.4.4 ceph4
10.0.5.1 csync1
10.0.5.2 csync2
10.0.5.3 csync3
10.0.5.3 csync4
As part of planning and preparing servers, we enabled passwordless, key-based SSH access between servers, a prerequisite for ceph installation.
Add the ceph repository
wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add - apt-add-repository 'deb https://download.ceph.com/debian-squid/ bookworm main' apt -y update apt -y upgrade
Install Ceph on all servers and remove cephadm (not using it), numactl for CPU pinning of OSD services.
apt -y install ceph python3-packaging numactl libhugetlbfs-bin libhugetlbfs0
apt remove --purge cephadm
apt reinstall python3-cryptography
apt reinstall ceph-mgr
Log in (SSH) into node ceph1 (first node)
Generate a unique uuid for FS.
uuidgen b115cfad-cce9-4404-a9eb-e821e856bbfd
Create a ceph configuration file /etc/ceph/ceph.conf (with only one monitor node to start with)
[global] cluster_network = 10.0.5.0/24 public_network = 10.0.4.0/24 fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a mon host = 10.0.4.1 mon initial members = ceph1 [mon] mon allow pool delete = true mon_max_pg_per_osd = 1024 [mon.ceph1] host = ceph1 mon addr = 10.0.4.1 [osd] osd crush update on start = true bluestore_rocksdb_options = memtable_huge_page_size=1048576 osd_memory_target = 8G mon_osd_down_out_interval = 180 osd_min_in_ratio = 0.75 osd_op_threads = 12 bluestore_cache_size = 4294967296 # 4 GiB bluestore_cache_max_dirty = 2147483648 bluestore_cache_deferred_read = true
Notes
We have enabled 'huge pages' support with a size of 1GM. So, adding "bluestore_rocksdb_options = memtable_huge_page_size=1048576" will enable OSD to use these, improving performance.
We have planned to reserve 10G per OSD (systemd configuration), which should be sufficient to cover 8G OSD memory target
When 25% of the OSDs are down, get into read only mode - ensure data consistency
Create a keyring for the cluster and generate a monitor secret key.
ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
Generate an administrator keyring, and generate a client.admin
user and add the user to the keyring
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'
Generate a bootstrap-osd keyring, and generate a client.bootstrap-osd
user and add the user to the keyring
ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd' --cap mgr 'allow r'
Import generated keys
ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
Generate a monitor map
FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'}) monmaptool --create --add ceph1 10.0.4.1 --fsid $FSID /etc/ceph/monmap Copy the generated map and configuration files to other nodes
scp /etc/ceph/* ceph2:/etc/ceph/ scp /etc/ceph/* ceph3:/etc/ceph/ scp /etc/ceph/* ceph4:/etc/ceph/ scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph2:/var/lib/ceph/bootstrap-osd scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph3:/var/lib/ceph/bootstrap-osd scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph4:/var/lib/ceph/bootstrap-osd ssh ceph2 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*" ssh ceph3 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*" ssh ceph4 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*"
Configure and enable the monitor daemon
- A default data directory on the monitor host
- Populate the monitor daemon with the monitor map and keyring
- Enable messenger v2 protocol
export NODENAME=ceph1 mkdir /var/lib/ceph/mon/ceph-$NODENAME ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring chown ceph:ceph /etc/ceph/ceph.* chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd systemctl enable --now ceph-mon@$NODENAME ceph mon enable-msgr2 ceph config set mon auth_allow_insecure_global_id_reclaim false
Create ceph volumes on all servers
Note: Carving out separate block devices for block.wal and block.data will reduce the latencies of write operations
No specific references to calculating size of WAL and DATA partitions – simple ad-hoc basis – 8G WAL / 32 G Block data per 1 TB RAW storage
Used the following snippet to create Ceph volumes on nvme disks (replace ‘x’ with nvme number)
Note: passing –osd-id does not help during ceph-volume lvm create – TODO update script to handle.
sleep is not required, however to avoid any challenges have included sleep.
#!/bin/bash export devs="nvme0,nvme1,nvme2" export osds="0,1,2" # Split into arrays IFS=',' read -ra dev_arr <<< "$devs" IFS=',' read -ra osd_arr <<< "$osds" for counter in "${!dev_arr[@]}"; do OSD="${osd_arr[$counter]}" DEVNAME="${dev_arr[$counter]}" SERVICE_NAME="ceph-osd@${OSD}.service" echo "============================================================================" echo "Device : $DEVNAME. Stopping ${SERVICE_NAME}" systemctl stop "${SERVICE_NAME}" sleep 2 #Remove OSD echo " " echo "Executing : ceph osd down $OSD" ceph osd down ${OSD} sleep 2 echo " " echo "Executing : ceph osd out $OSD" ceph osd out ${OSD} sleep 2 echo " " echo "Executing : ceph osd crush remove $OSD" ceph osd crush remove osd.${OSD} sleep 2 echo " " echo "Executing : ceph auth del osd.$OSD" ceph auth del osd.${OSD} sleep 2 echo " " echo "Executing : ceph osd rm $OSD" ceph osd rm ${OSD} sleep 2 #Identify WAL size / Block DB size - 8G / 32G per TB respectively..." sectors=$(cat /sys/block/${DEVNAME}n1/size) bytes=$(( sectors * 512 )) tb_bytes=$(( 1000 * 1000 * 1000 * 1000 )) tb_ceil=$(( (bytes) / tb_bytes )) # In case of smaller storage media keep min calculations against 1 TB if [[ tb_ceil -lt 1 ]]; then tb_ceil=1 fi wal_gb=$(( tb_ceil * 8 )) data_gb=$(( tb_ceil * 32 )) echo "Device : $DEVNAME, OSD : $OSD, Bytes : $bytes, WAL : $wal_gb, BLOCKDATA : $data_gb" # ZAP ceph volumes if present # Assuming only three partitions were created earlier for (( i=1; i <= 3; i++ )); do partition="/dev/${DEVNAME}n1p${i}" if [ -b "$partition" ]; then sleep 3 echo " " echo "Executing ceph-volume lvm zap $partition" ceph-volume lvm zap $partition echo " " echo "Executing wipefs --all --force --quiet ${partition}" wipefs --all --force --quiet ${partition} sleep 2 else echo "Partition $partition does not exist." fi done # ZAP reports device in use - work around lsblk /dev/${DEVNAME}n1 | grep lvm | sed 's/^....//' | cut -d " " -f1 | grep "^ceph" > dmsetup.txt while IFS= read -r line do echo "Executing dmsetup remove $line" dmsetup remove "$line" done < dmsetup.txt rm -f dmsetup.txt sleep 2 # Extra steps to clear FS echo " " echo "Executing dd if=/dev/zero of=/dev/${DEVNAME}n1 bs=1M count=10" dd if=/dev/zero of=/dev/${DEVNAME}n1 bs=1M count=10 sleep 3 echo " " echo "Exeucting partprobe /dev/${DEVNAME}n1" partprobe /dev/${DEVNAME}n1 # Create partitions for ceph sleep 3 echo " " echo "Creating partitions - Device ${DEVNAME}n1" parted /dev/${DEVNAME}n1 --script \ mklabel gpt \ unit GB \ mkpart primary 1MiB "${wal_gb}GB" \ mkpart primary "${wal_gb}GB" "$((wal_gb + data_gb))GB" \ mkpart primary "$((wal_gb + data_gb))GB" 100% \ echo " " echo "Creating ceph volumes..." #Create the ceph volume sleep 3 echo " " echo "Executing ceph-volume lvm --osd-id ${OSD} create --bluestore --data /dev/${DEVNAME}n1p3 --block.db /dev/${DEVNAME}n1p2 --block.wal /dev/${DEVNAME}n1p1 --crush-device-class nvme" ceph-volume lvm --osd-id ${OSD} create --bluestore --data /dev/${DEVNAME}n1p3 --block.db /dev/${DEVNAME}n1p2 --block.wal /dev/${DEVNAME}n1p1 --crush-device-class nvme #Start OSD service - may not be required sleep 3 echo " " echo "Executing systemctl start ceph-osd@$OSD.service" systemctl start ceph-osd@${OSD}.service sleep 3 echo " " echo "Executing ceph osd in $OSD" ceph osd in ${OSD} done
Used the following snippet to create Ceph volumes on ssd disks
#!/bin/bash export devs="cephdisk3,cephdisk4,cephdisk5,cephdisk6" export osds="3,4,5,6"
After creating all ceph volumes, check the status
root@server1:~# ceph -s cluster: id: 577c09c2-c514-471a-aee1-6a0f56c83c3a health: HEALTH_WARN no active mgr services: mon: 1 daemons, quorum ceph1 (age 35m) mgr: no daemons active osd: 18 osds: 18 up (since 22s), 18 in (since 34s) data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs:
root@server1:~# ceph osd tree
root@server1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 27.75752 root default
-3 6.46405 host server1
0 nvme 1.80479 osd.0 up 1.00000 1.00000
1 nvme 0.90239 osd.1 up 1.00000 1.00000
2 nvme 0.90239 osd.2 up 1.00000 1.00000
3 ssd 0.86600 osd.3 up 1.00000 1.00000
4 ssd 0.86600 osd.4 up 1.00000 1.00000
5 ssd 0.90239 osd.5 up 1.00000 1.00000
6 ssd 0.22009 osd.6 up 1.00000 1.00000
-7 7.07356 host server2
7 nvme 1.80479 osd.7 up 1.00000 1.00000
8 nvme 0.90239 osd.8 up 1.00000 1.00000
9 nvme 0.90239 osd.9 up 1.00000 1.00000
10 ssd 0.86600 osd.10 up 1.00000 1.00000
11 ssd 0.86600 osd.11 up 1.00000 1.00000
12 ssd 0.86600 osd.12 up 1.00000 1.00000
13 ssd 0.86600 osd.13 up 1.00000 1.00000
-10 7.14635 host server3
14 nvme 1.80479 osd.14 up 1.00000 1.00000
15 nvme 0.90239 osd.15 up 1.00000 1.00000
16 nvme 0.90239 osd.16 up 1.00000 1.00000
17 ssd 0.86600 osd.17 up 1.00000 1.00000
18 ssd 0.90239 osd.18 up 1.00000 1.00000
19 ssd 0.90239 osd.19 up 1.00000 1.00000
20 ssd 0.86600 osd.20 up 1.00000 1.00000
-13 7.07356 host server4
21 nvme 1.80479 osd.21 up 1.00000 1.00000
22 nvme 0.90239 osd.22 up 1.00000 1.00000
23 nvme 0.90239 osd.23 up 1.00000 1.00000
24 ssd 0.86600 osd.24 up 1.00000 1.00000
25 ssd 0.86600 osd.25 up 1.00000 1.00000
26 ssd 0.86600 osd.26 up 1.00000 1.00000
27 ssd 0.86600 osd.27 up 1.00000 1.00000
root@server1:~#
Configure monitor services in ceph2 and ceph3 to have three monitor nodes.
Log in to ceph1 and update the monitor map to include ceph2 and ceph3 as monitor nodes.
FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'}) monmaptool --add ceph2 10.0.4.2 --fsid $FSID /etc/ceph/monmap monmaptool --add ceph3 10.0.4.3 --fsid $FSID /etc/ceph/monmap
Update the ceph configuration file to reflect the new monitor nodes.
[global] cluster_network = 10.0.5.0/24 public_network = 10.0.4.0/24 fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a mon host = 10.0.4.1,10.0.4.2,10.0.4.3 mon initial members = ceph1,ceph2,ceph3 [mon] mon allow pool delete = true mon_max_pg_per_osd = 3072 [mon.ceph1] host = ceph1 mon addr = 10.0.4.1 [mon.ceph2] host = ceph2 mon addr = 10.0.4.2 [mon.ceph3] host = ceph3 mon addr = 10.0.4.3 [osd] osd crush update on start = true bluestore_rocksdb_options = memtable_huge_page_size=1048576 osd_memory_target = 8G mon_osd_down_out_interval = 180 osd_min_in_ratio = 0.75 osd_op_threads = 12 bluestore_cache_size = 4294967296 # 4 GiB bluestore_cache_max_dirty = 2147483648 bluestore_cache_deferred_read = true
Copy the generated map and configuration files to other nodes
scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph2:/etc/ceph/ scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph3:/etc/ceph/ scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph4:/etc/ceph/ ssh ceph2 "chown ceph:ceph -R /etc/ceph" ssh ceph3 "chown ceph:ceph -R /etc/ceph" ssh ceph4 "chown ceph:ceph -R /etc/ceph"
On each of the new monitor nodes (ceph2 and ceph3)
- Log in (SSH) into the node
- A default data directory on the monitor host
- Populate the monitor daemon with the monitor map and keyring
- Enable messenger v2 protocol
export NODENAME=ceph2 mkdir /var/lib/ceph/mon/ceph-$NODENAME ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring chown ceph:ceph /etc/ceph/ceph.* chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd systemctl enable --now ceph-mon@$NODENAME ceph mon enable-msgr2 ceph config set mon auth_allow_insecure_global_id_reclaim false
On each of the manager nodes (ceph1, ceph2, ceph3, ceph4)
- Log in (SSH) into the node
- Create a default data directory on the manager host
- Create an authentication key for the manager daemon
- Enable the daemon to start on host startup
NODENAME=ceph1
mkdir /var/lib/ceph/mgr/ceph-$NODENAME
ceph auth get-or-create mgr.$NODENAME mon 'allow profile mgr' osd 'allow *' mds 'allow *'
ceph auth get-or-create mgr.$NODENAME | tee /etc/ceph/ceph.mgr.admin.keyring
cp /etc/ceph/ceph.mgr.admin.keyring /var/lib/ceph/mgr/ceph-$NODENAME/keyring
chown ceph:ceph /etc/ceph/ceph.mgr.admin.keyring
chown -R ceph:ceph /var/lib/ceph/mgr/ceph-$NODENAME
systemctl enable --now ceph-mgr@$NODENAME
At this stage, OSD daemons are up; three monitor daemons are up
root@server1:~# ceph -s cluster: id: 577c09c2-c514-471a-aee1-6a0f56c83c3a health: HEALTH_OK services: mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 5m) mgr: ceph1(active, since 4m), standbys: ceph2, ceph3, ceph4 osd: 18 osds: 18 up (since 31m), 18 in (since 31m) data: pools: 1 pools, 1 pgs objects: 2 objects, 961 KiB usage: 1.4 GiB used, 20 TiB / 20 TiB avail pgs: 1 active+clean root@server1:~#
Setting CPU and RAM limits for OSD daemons (Repeat in all servers)
How much to reserve is unknown. Start with some values based on availability/affordability.
Reserving resources for OSD services
sudo mkdir -p /etc/systemd/system/ceph-osd@.service.d
Create a file /etc/systemd/system/ceph-osd@.service.d/osd.conf with the following contents
[Service] # cgroup limits MemoryMax=10240M #CPUAccounting=true CPUQuota=400% # Clear the upstream ExecStart, then launch with your hugepage preload ExecStart= ExecStart=/usr/bin/env LD_PRELOAD=/lib/x86_64-linux-gnu/libhugetlbfs.so.0 HUGETLB_MORECORE=yes /usr/bin/ceph-osd --id %i --foreground
Reload and restart the Ceph OSD services
sudo systemctl daemon-reload
sudo systemctl restart ‘ceph-osd@*’
Reserving resources for monitoring services
sudo mkdir -p /etc/systemd/system/ceph-mon@.service.d
Create a file /etc/systemd/system/ceph-mon@.service.d/monitor.conf with the following contents
[Service] MemoryMax=8192M #CPUAccounting=true CPUQuota=400%
Reload and restart the ceph-mon services
sudo systemctl daemon-reload
sudo systemctl restart ‘ceph-mon@*’
Setting CPU and RAM limits for Manager daemons
sudo mkdir -p /etc/systemd/system/ceph-mgr@.service.d
Create a file /etc/systemd/system/ceph-mgr@.service.d/monitor.conf with the following contents
[Service] MemoryMax=2048M #CPUAccounting=true CPUQuota=200%
Reload and restart the ceph-mgr services
sudo systemctl daemon-reload
sudo systemctl restart ‘ceph-mgr@*’
Create crush rules for logically grouping NVME and SSD OSDs
ceph osd crush rule create-replicated nvme_rule default host nvme ceph osd crush rule create-replicated ssd_rule default host ssd
Configure .mgr pool to use NVME storage
ceph osd pool set .mgr crush_rule nvme_rule
Calculate placement group size (pg_num) for nvme and SSD storage pools. I plan to create the following five pools on nvme storage, with replication set to 3
For RBD
nvmepool
For Rados GW (S3 storage for snapshotting etcd when using RKE2 and other S3 needs)
rgw.buckets.data
rgw.buckets.index
rgw.buckets.log
rgw.control
I have 9 NVMe-based OSDs, so pg_num = (9 x 100/3 x 5) = 60, rounded off to 64 [ 3 is replication, 5 is pool count ]
Create OSD pools associated with NVME.
# 1) RBD ceph osd pool create nvmepool 64 64 replicated nvme_rule --size=3 ceph osd pool application enable nvmepool rbd rbd pool init nvmepool ceph osd pool set nvmepool pg_autoscale_mode off ceph osd pool set nvmepool pg_num 64 ceph osd pool set nvmepool pg_num_min 64 ceph osd pool set nvmepool pg_num_max 64 # 2) buckets data (objects) ceph osd pool create rgw.buckets.data 64 64 replicated nvme_rule --size=3 ceph osd pool application enable rgw.buckets.data rgw # 3) buckets index (metadata) ceph osd pool create rgw.buckets.index 64 64 replicated nvme_rule --size=3 ceph osd pool application enable rgw.buckets.index rgw # 4) buckets log (optional logging) ceph osd pool create rgw.buckets.log 64 64 replicated nvme_rule --size=3 ceph osd pool application enable rgw.buckets.log rgw # 5) control (internal RGW control messages) ceph osd pool create rgw.control 64 64 replicated nvme_rule --size=3 ceph osd pool application enable rgw.control rgw
while [ $(ceph -s | grep creating -c) -gt 0 ];do echo -n .;sleep 1; done
Create OSD pools associated with SSD
I have 16 SSD-based OSDs, so pg_num = (16 x 100/3 x 1) = 533, rounded off to nearest power of 2, 512 [ 3 is replication, 1 is pool count ] ceph osd pool create ssdpool 512 512 replicated nvme_rule --size=3 while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done ceph osd pool application enable ssdpool rbd rbd pool init ssdpool ceph osd pool set ssdpool pg_autoscale_mode off ceph osd pool set ssdpool pg_num 512 ceph osd pool set ssdpool pg_num_min 512 ceph osd pool set ssdpool pg_num_max 512 Optional testing [ Create a block device and mount it locally ] rbd create --size 10G --pool nvmepool nvmerbd rbd map nvmerbd --pool nvmepool # (Change the device name if a different one gets created) mkfs.ext4 /dev/rbd0 mkdir /root/test mount /dev/rbd0 /root/test
Enable ceph-mgr-dashboard; create an administrator user account to access the dashboard
Create a text file with the administrator password to be used (in this case I had created dbpass.txt)
ceph mgr module enable dashboard ceph config set mgr mgr/dashboard/ssl false ceph dashboard ac-user-create admin -i dbpass.txt administrator
Now access http://ceph1:8080
TODO: CPU Pinning for Ceph-OSD
In all servers
NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71
Update /etc/systemd/system/ceph-osd@.service.d/osd.conf with the following
[Service] # cgroup limits MemoryMax=10240M #CPUAccounting=true CPUQuota=400% # Clear the upstream ExecStart, then launch with your hugepage preload ExecStart= ExecStart=/usr/bin/numactl --membind=0 --physcpubind=0,2,4,6 env LD_PRELOAD=/lib/x86_64-linux-gnu/libhugetlbfs.so.0 HUGETLB_MORECORE=yes /usr/bin/ceph-osd --id %i --foreground
Explanation
-
--membind=0
: Allocates memory (hugepages) only from NUMA node 0 -
--physcpubind=0,2,4,6
: Pins the OSD to physical CPUs 0,2,4,6