Previous: Planning / Preparing servers Next: Installing KVM
The plan is to use 10.0.4.0/24 for the public network and 10.0.5.0/24 for the cluster network.
As part of planning/preparing server /etc/hosts was updated in all servers
10.0.4.1 ceph1 10.0.4.2 ceph2 10.0.4.3 ceph3 10.0.4.4 ceph4
10.0.5.1 ceph-private1
10.0.5.2 ceph-private2
10.0.5.3 ceph-private3
10.0.5.3 ceph-private4
As part of planning and preparing servers, we enabled passwordless, key-based ssh access between servers, which is a pre-requisite for ceph installation.
Add ceph repository
wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add - apt-add-repository 'deb https://download.ceph.com/debian-reef/ bookworm main' apt -y update apt -y upgrade
Install Ceph on all servers and remove cephadm (not using it)
apt -y install ceph python3-packaging
apt remove --purge cephadm
apt reinstall python3-cryptography
apt reinstall ceph-mgr
Login (ssh) into node ceph1 (first node)
Generate a unique uuid for FS.
uuidgen b115cfad-cce9-4404-a9eb-e821e856bbfd
Create ceph configuration file /etc/ceph/ceph.conf (with only one monitor node to start with)
[global]
cluster_network = 10.0.5.0/24
public_network = 10.0.4.0/24
fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a
mon host = 10.0.4.1
mon initial members = ceph1
[mon]
mon allow pool delete = true
mon_max_pg_per_osd = 3072
[mon.ceph1]
host = ceph1
mon addr = 10.0.4.1
[osd]
osd crush update on start = true
bluestore_rocksdb_options = memtable_huge_page_size=2048
osd_memory_target = 14G
mon_osd_down_out_interval= 180
osd_min_in_ratio = 0.75
osd_op_threads = 6
Notes
We have enabled 'huge pages' support with a size of 2M. So, adding "bluestore_rocksdb_options = memtable_huge_page_size=2048" will enable OSD to use these, improving performance.
We have planned to reserve 16G per OSD (systemd configuration), which should be sufficient to cover 14G OSD memory target
We would like a 3 minute downtime of an OSD to be treated as a need to rebalance pgs with available OSDs
When 25% of the OSDs are down, get into read only mode - ensure data consistency
Create a keyring for the cluster and generate a monitor secret key.
ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
Generate an administrator keyring, generate a client.admin
user and add the user to the keyring
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'
Generate a bootstrap-osd keyring, generate a client.bootstrap-osd
user and add the user to the keyring
ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd' --cap mgr 'allow r'
Import generated keys
ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
Generate monitor map
FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'}) monmaptool --create --add ceph1 10.0.4.1 --fsid $FSID /etc/ceph/monmap Copy the generated map and configuration files to other nodes
scp /etc/ceph/* ceph2:/etc/ceph/ scp /etc/ceph/* ceph3:/etc/ceph/ scp /etc/ceph/* ceph4:/etc/ceph/ scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph2:/var/lib/ceph/bootstrap-osd scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph3:/var/lib/ceph/bootstrap-osd scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph4:/var/lib/ceph/bootstrap-osd ssh ceph2 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*" ssh ceph3 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*" ssh ceph4 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*"
Configure and enable monitor daemon
- A default data directory on the monitor host
- Populate the monitor daemon with the monitor map and keyring
- Enable messenger v2 protocol
export NODENAME=ceph1 mkdir /var/lib/ceph/mon/ceph-$NODENAME ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring chown ceph:ceph /etc/ceph/ceph.* chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd systemctl enable --now ceph-mon@$NODENAME ceph mon enable-msgr2 ceph config set mon auth_allow_insecure_global_id_reclaim false
Create ceph volumes in all servers
Used the following snippet to create ceph-volumes on nvme disks (replace ‘x’ with nvme number)
export devs="nvme0 nvme1 nvme2" for DEVNAME in $devs; do if lsblk | grep "${DEVNAME}n1p1"; then echo "Clearing /dev/${DEVNAME}n1" wipefs -a /dev/${DEVNAME}n1 else echo "No partitions found on /dev/${DEVNAME}n1" fi parted /dev/${DEVNAME}n1 mklabel gpt parted -s /dev/${DEVNAME}n1 "mkpart primary 0% 100%" wipefs -a /dev/${DEVNAME}n1p1 ceph-volume lvm create --data /dev/${DEVNAME}n1p1 --crush-device-class nvme done
Used the following snippet to create ceph-volumes on ssd disks
export devs="sda sde" for DEVNAME in $devs; do if lsblk | grep "${DEVNAME}1"; then echo "Clearing /dev/${DEVNAME}" wipefs -a /dev/${DEVNAME} else echo "No partitions found on /dev/${DEVNAME}" fi parted /dev/${DEVNAME} mklabel gpt parted -s /dev/${DEVNAME} "mkpart primary 0% 100%" wipefs -a /dev/${DEVNAME}1 ceph-volume lvm create --data /dev/${DEVNAME}1 --crush-device-class ssd done
After creating all ceph volumes, check the status
root@server1:~# ceph -s cluster: id: 577c09c2-c514-471a-aee1-6a0f56c83c3a health: HEALTH_WARN no active mgr services: mon: 1 daemons, quorum ceph1 (age 35m) mgr: no daemons active osd: 18 osds: 18 up (since 22s), 18 in (since 34s) data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs:
root@server1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 19.79449 root default -3 5.38527 host server1 0 nvme 1.81929 osd.0 up 1.00000 1.00000 1 nvme 0.90970 osd.1 up 1.00000 1.00000 2 nvme 0.90970 osd.2 up 1.00000 1.00000 3 ssd 0.87329 osd.3 up 1.00000 1.00000 4 ssd 0.87329 osd.4 up 1.00000 1.00000 -7 5.38527 host server2 5 nvme 1.81929 osd.5 up 1.00000 1.00000 6 nvme 0.90970 osd.6 up 1.00000 1.00000 7 nvme 0.90970 osd.7 up 1.00000 1.00000 8 ssd 0.87329 osd.8 up 1.00000 1.00000 9 ssd 0.87329 osd.9 up 1.00000 1.00000 -10 3.63869 host server3 10 nvme 1.81929 osd.10 up 1.00000 1.00000 11 nvme 0.90970 osd.11 up 1.00000 1.00000 12 nvme 0.90970 osd.12 up 1.00000 1.00000 -13 5.38527 host server4 13 nvme 1.81929 osd.13 up 1.00000 1.00000 14 nvme 0.90970 osd.14 up 1.00000 1.00000 15 nvme 0.90970 osd.15 up 1.00000 1.00000 16 ssd 0.87329 osd.16 up 1.00000 1.00000 17 ssd 0.87329 osd.17 up 1.00000 1.00000 root@server1:~#
Configure monitor services in ceph2 and ceph3 to have three monitor nodes.
Login to ceph1 and update the monitor map to include ceph2 and ceph3 as monitor nodes.
FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'}) monmaptool --add ceph2 10.0.4.2 --fsid $FSID /etc/ceph/monmap monmaptool --add ceph3 10.0.4.3 --fsid $FSID /etc/ceph/monmap
Update the ceph configuration file to reflect the new monitor nodes.
[global]
cluster_network = 10.0.5.0/24
public_network = 10.0.4.0/24
fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a
mon host = 10.0.4.1,10.0.4.2,10.0.4.3
mon initial members = ceph1,ceph2,ceph3
[mon]
mon allow pool delete = true
mon_max_pg_per_osd = 3072
[mon.ceph1]
host = ceph1
mon addr = 10.0.4.1
[mon.ceph2]
host = ceph2
mon addr = 10.0.4.2
[mon.ceph3]
host = ceph3
mon addr = 10.0.4.3
[osd]
osd crush update on start = true
bluestore_rocksdb_options = memtable_huge_page_size=2048
osd_memory_target = 14G
mon_osd_down_out_interval = 180
osd_min_in_ratio = 0.75
osd_op_threads = 6
Copy the generated map and configuration files to other nodes
scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph2:/etc/ceph/ scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph3:/etc/ceph/ scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph4:/etc/ceph/ ssh ceph2 "chown ceph:ceph -R /etc/ceph" ssh ceph3 "chown ceph:ceph -R /etc/ceph" ssh ceph4 "chown ceph:ceph -R /etc/ceph"
On each of the new monitor nodes (ceph2 and ceph3)
- Login (ssh) into the node
- A default data directory on the monitor host
- Populate the monitor daemon with the monitor map and keyring
- Enable messenger v2 protocol
export NODENAME=ceph2 mkdir /var/lib/ceph/mon/ceph-$NODENAME ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring chown ceph:ceph /etc/ceph/ceph.* chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd systemctl enable --now ceph-mon@$NODENAME ceph mon enable-msgr2 ceph config set mon auth_allow_insecure_global_id_reclaim false
On each of the manager nodes (ceph1, ceph2, ceph3, ceph4)
- Login (ssh) into the node
- Create a default data directory on the manager host
- Create an authentication key for the manager daemon
- Enable the daemon to start on host startup
NODENAME=ceph1
mkdir /var/lib/ceph/mgr/ceph-$NODENAME
ceph auth get-or-create mgr.$NODENAME mon 'allow profile mgr' osd 'allow *' mds 'allow *'
ceph auth get-or-create mgr.$NODENAME | tee /etc/ceph/ceph.mgr.admin.keyring
cp /etc/ceph/ceph.mgr.admin.keyring /var/lib/ceph/mgr/ceph-$NODENAME/keyring
chown ceph:ceph /etc/ceph/ceph.mgr.admin.keyring
chown -R ceph:ceph /var/lib/ceph/mgr/ceph-$NODENAME
systemctl enable --now ceph-mgr@$NODENAME
At this stage, OSD daemons are up, three monitor daemons are up
root@server1:~# ceph -s cluster: id: 577c09c2-c514-471a-aee1-6a0f56c83c3a health: HEALTH_OK services: mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 5m) mgr: ceph1(active, since 4m), standbys: ceph2, ceph3, ceph4 osd: 18 osds: 18 up (since 31m), 18 in (since 31m) data: pools: 1 pools, 1 pgs objects: 2 objects, 961 KiB usage: 1.4 GiB used, 20 TiB / 20 TiB avail pgs: 1 active+clean root@server1:~#
Setting CPU and RAM limits for OSD daemons (Repeat in all servers)
How much to reserve is unknown. Start with some values based on availability/affordability.
Reserving resources.
systemctl edit ceph-osd@.service
### Anything between here and the comment below will become the contents of the drop-in file [Service] MemoryMax=16384M CPUQuota=600% ### Edits below this comment will be discarded
Setting CPU and RAM limits for Monitor daemons
systemctl edit ceph-mon@.service
### Anything between here and the comment below will become the contents of the drop-in file [Service] MemoryMax=8192M CPUQuota=400% ### Edits below this comment will be discarded
Setting CPU and RAM limits for Manager daemons
systemctl edit ceph-mgr@.service
### Anything between here and the comment below will become the contents of the drop-in file [Service] MemoryMax=2048M CPUQuota=200% ### Edits below this comment will be discarded
Create crush rules for logically grouping NVME and SSD OSDs
ceph osd crush rule create-replicated nvme_rule default host nvme ceph osd crush rule create-replicated ssd_rule default host ssd
Configure .mgr pool to use NVME storage
ceph osd pool set .mgr crush_rule nvme_rule
Create an RBD pool and initialize
Create two pools, one with nvme_rule and one with ssd_rule, and both with a replication factor of three (default even if not configured)
ceph osd pool create ssdpool 512 512 replicated ssd_rule --size=3 while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done ceph osd pool application enable ssdpool rbd rbd pool init ssdpool ceph osd pool create nvmepool 1024 1024 replicated nvme_rule --size=3 while [ $(ceph -s | grep creating -c) -gt 0 ];do echo -n .;sleep 1; done ceph osd pool application enable nvmepool rbd rbd pool init nvmepool Create a block device and mount it locally.
rbd create --size 10G --pool nvmepool nvmebd rbd map nvmebd --pool nvmepool mkfs.ext4 /dev/rbd0 (Change the device name if a different one gets created) mkdir /root/test mount /dev/rbd0 /root/test
Enable ceph-mgr-dashboard; create an administrator user account to access the dashboard
Create a text file with the administrator password to be used (in this case I had created dbpass.txt)
ceph mgr module enable dashboard ceph config set mgr mgr/dashboard/ssl false ceph dashboard ac-user-create admin -i dbpass.txt administrator
Now access http://ceph1:8080