Previous: Planning / Preparing servers Next : Working notes – Ceph
The plan is to use 10.0.4.0/24 for the public network and 10.0.5.0/24 for the cluster network.
Edit and update /etc/hosts in all servers by adding the following entries
10.0.4.1 storage1 10.0.4.2 storage2 10.0.4.3 storage3 10.0.4.4 storage4
10.0.5.1 ceph1 10.0.5.2 ceph2 10.0.5.3 ceph3 10.0.5.4 ceph4
Enable key-based access between nodes (root user)
Create /root/.ssh/config file with the following content in all four servers.
Host storage1 Hostname storage1 User root Host storage2 Hostname storage2 User root Host storage3 Hostname storage3 User root Host storage4 Hostname storage4 User root
Set the correct permissions for the file
chmod 600 /root/.ssh/config
Generate key in all servers
ssh-keygen -q -N ""
Transfer the public key from each node to other nodes
ssh-keygen -f '/root/.ssh/known_hosts' -R 'storage1' ssh-keygen -f '/root/.ssh/known_hosts' -R 'storage2' ssh-keygen -f '/root/.ssh/known_hosts' -R 'storage3' ssh-keygen -f '/root/.ssh/known_hosts' -R 'storage4' ssh-copy-id storage1 ssh-copy-id storage2 ssh-copy-id storage3 ssh-copy-id storage4
Install Ceph on all servers
apt -y install ceph python3-packaging
Generate a unique uuid for FS.
uuidgen b115cfad-cce9-4404-a9eb-e821e856bbfd
Start building the cluster with one monitor node. Create initial ceph configuration file /etc/ceph/ceph.conf
root@server3:~# cat /etc/ceph/ceph.conf [global] cluster_network = 10.0.5.0/24 public_network = 10.0.4.0/24 fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a mon host = 10.0.4.1 mon initial members = storage1 [mon.storage1] host = storage1 mon addr = 10.0.4.1 mon allow pool delete = true [osd] osd crush update on start = true bluestore_rocksdb_options = memtable_huge_page_size=2048
We have enabled ‘huge pages’ support with a size of 2M. So, adding “bluestore_rocksdb_options = memtable_huge_page_size=2048” will enable OSD to use these, improving performance.
Generate a secret key for monitoring.
ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
Generate secret key for client admin.
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'
Generate key for bootstrapping osd
ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd' --cap mgr 'allow r'
Import generated keys
ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
Generate monitor map (initial only one node)
FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'}) NODENAME=$(grep "^mon initial" /etc/ceph/ceph.conf | awk {'print $NF'}) NODEIP=$(grep "^mon host" /etc/ceph/ceph.conf | awk {'print $NF'}) monmaptool --create --add $NODENAME $NODEIP --fsid $FSID /etc/ceph/monmap
Create a directory for Monitor Daemon [ clustername-nodename ]
mkdir /var/lib/ceph/mon/ceph-$NODENAME ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring
Ensure the ceph user owns all files and then enable the monitor daemon.
chown ceph:ceph /etc/ceph/ceph.* chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd systemctl enable --now ceph-mon@$NODENAME
Enable Messenger V2 protocol and placement group autoscaler.
ceph mon enable-msgr2 ceph config set mon auth_allow_insecure_global_id_reclaim false ceph mgr module enable pg_autoscaler
Copy the configuration and keyring files into all other servers
Repeat steps for each server, set value for SERVER appropriately
export SERVER=server2 ssh-keygen -f '/root/.ssh/known_hosts' -R ${SERVER} scp /etc/ceph/ceph.conf ${SERVER}:/etc/ceph/ceph.conf scp /etc/ceph/ceph.client.admin.keyring ${SERVER}:/etc/ceph scp /var/lib/ceph/bootstrap-osd/ceph.keyring ${SERVER}:/var/lib/ceph/bootstrap-osd ssh ${SERVER} "chown ceph:ceph /etc/ceph/ceph.* /var/lib/ceph/bootstrap-osd/*"
Configuring manager nodes – Repeat steps with NODENAME changed to storage3 and storage4
SSH into the server (following steps for server1)
Create a directory for the manager daemon [ clustername-nodename ] and generate the auth key
NODENAME=storage1 mkdir /var/lib/ceph/mgr/ceph-$NODENAME ceph auth get-or-create mgr.$NODENAME mon 'allow profile mgr' osd 'allow *' mds 'allow *'
Create the admin keyring files, copy the files, set ownership and enable the manager daemon (initial only one)
ceph auth get-or-create mgr.$NODENAME | tee /etc/ceph/ceph.mgr.admin.keyring cp /etc/ceph/ceph.mgr.admin.keyring /var/lib/ceph/mgr/ceph-$NODENAME/keyring chown ceph:ceph /etc/ceph/ceph.mgr.admin.keyring chown -R ceph:ceph /var/lib/ceph/mgr/ceph-$NODENAME systemctl enable --now ceph-mgr@$NODENAME
A basic ceph cluster with one monitor, three managers and no OSDs is now up and running
root@server1:~# ceph -s cluster: id: b115cfad-cce9-4404-a9eb-e821e856bbfd health: HEALTH_WARN OSD count 0 < osd_pool_default_size 3 services: mon: 1 daemons, quorum storage1 (age 15m) mgr: storage4(active, since 68s), standbys: storage1, storage3 osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs:
Before creating ceph volumes (and enabling OSD services), ensuring that no LV, VG or PV remains on the target storage device is recommended.
Simple steps: use the lv/vg/pv display commands, and if any are listed, use lv/vg/pv remove in order – LV first, VG second and PV third.
lvdisplay
lvremove <LV Path>
vgdisplay
vgremove <VG Name>
pvdisplay
pvremove <PV Name>
Create ceph volumes in all servers
Note: I include wipefs as I tried out multiple iterations of installations and found the need.
Note: Using/dev/disk/by-partuuid is recommended when creating the ceph-volume. So, I used blkid to get the part-uuid after marking the primary partition.
Note: Set the value for the crush-device-class name based on your needs for the logical grouping of OSDs. I plan to use nvme and ssd.
parted /dev/nvme0n1 mklabel gpt
parted --script /dev/nvme0n1 "mkpart primary 0% 100%"
wipefs -a /dev/nvme0n1p1
blkid
ceph-volume lvm create --data /dev/disk/by-partuuid/34dceb3b-67e0-4cf8-8bac-3a15105a50c5 --crush-device-class nvme
After creating all ceph volumes check the status
root@server1:~# ceph -s cluster: id: 577c09c2-c514-471a-aee1-6a0f56c83c3a health: HEALTH_OK services: mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 6m) mgr: ceph1(active, since 27m), standbys: ceph2, ceph4, ceph3 osd: 18 osds: 18 up (since 6m), 18 in (since 6m) data: pools: 1 pools, 1 pgs objects: 2 objects, 705 KiB usage: 1.4 GiB used, 20 TiB / 20 TiB avail pgs: 1 active+clean root@server1:~#
Configure monitor services in storage2 and storage3 to have three monitor nodes.
Login to server1 and update the monitor map to include storage2 and storage4 as monitor nodes.
FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'}) monmaptool --add storage2 10.0.4.2 --fsid $FSID /etc/ceph/monmap monmaptool --add storage4 10.0.4.4 --fsid $FSID /etc/ceph/monmap
Update the ceph configuration file to reflect the monitor and manager nodes.
[global] cluster_network = 10.0.5.0/24 public_network = 10.0.4.0/24 fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a mon host = 10.0.4.1,10.0.4.2,10.0.4.4 mon initial members = storage1,storage2,storage4 mgr host = 10.0.4.1,10.0.4.2,10.0.4.3,10.0.4.4 mgr initial members = storage1,storage2,storage3,storage4 [mon.storage1] host = storage1 mon addr = 10.0.4.1 mon allow pool delete = true [mon.storage2] host = storage2 mon addr = 10.0.4.2 mon allow pool delete = true [mon.storage4] host = storage4 mon addr = 10.0.4.4 mon allow pool delete = true [mgr.storage1] host = storage1 mon addr = 10.0.4.1 [mgr.storage2] host = storage2 mon addr = 10.0.4.2 [mgr.storage3] host = storage3 mon addr = 10.0.4.3 [mgr.storage4] host = storage4 mon addr = 10.0.4.4 [osd] osd crush update on start = true bluestore_rocksdb_options = memtable_huge_page_size=2048
Login into new monitor nodes (servers) and configure the monitoring daemon (Steps to be repeated in storage 2 and storage 4)
Copy the required config files for the existing monitor node (server1)
SOURCE="storage1" SRCIP="10.0.4.1" NEWNODE="storage2" scp ${SRCIP}:/etc/ceph/ceph.conf /etc/ceph/ceph.conf scp ${SRCIP}:/etc/ceph/ceph.mon.keyring /etc/ceph scp ${SRCIP}:/etc/ceph/monmap /etc/ceph
Create a folder for persistence usage by monitoring daemon, set ownership of files to ‘ceph’ user account and start monitoring daemon.
rm -rf /var/lib/ceph/mon/ceph-${NEWNODE} mkdir -p /var/lib/ceph/mon/ceph-${NEWNODE} chown -R ceph:ceph /etc/ceph /var/lib/ceph/mon ceph-mon --cluster ceph --mkfs -i $NEWNODE --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring chown -R ceph:ceph /etc/ceph /var/lib/ceph/mon systemctl enable --now ceph-mon@$NEWNODE ceph mon enable-msgr2
At this stage, OSD daemons are up, three monitor daemons are up
root@server1:~# ceph -s
cluster:
id: 577c09c2-c514-471a-aee1-6a0f56c83c3a
health: HEALTH_OK
services:
mon: 3 daemons, quorum storage1,storage4,storage2 (age 45m)
mgr: storage4(active, since 43m), standbys: storage1, storage3
osd: 8 osds: 8 up (since 23m), 8 in (since 3h)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 705 KiB
usage: 679 MiB used, 7.3 TiB / 7.3 TiB avail
pgs: 1 active+clean
root@server1:~#
Setting CPU and RAM limits for OSD daemons (Repeat in all servers)
How much to reserve is not known, however, based on some tests, I arrived at four vCPUs and 8G RAM
systemctl edit ceph-osd@.service
### Anything between here and the comment below will become the contents of the drop-in file [Service] MemoryMax=8192M CPUQuota=400% ### Edits below this comment will be discarded
Reload unit files and restart OSD services
systemctl daemon-reload systemctl restart ceph-osd.target
Setting CPU and RAM limits for Monitor daemons
How much to reserve is not known, so I decided to reserve 2 vCPUs and 4G RAM
systemctl edit ceph-mon@.service
### Anything between here and the comment below will become the contents of the drop-in file [Service] MemoryMax=4096M CPUQuota=200% ### Edits below this comment will be discarded
Reload unit files and restart monitoring services
systemctl daemon-reload systemctl restart ceph-mon.target
Setting CPU and RAM limits for Manager daemons
How much to reserve is not known, so I decided to reserve 1 vCPU and 2G RAM
systemctl edit ceph-mgr@.service
### Anything between here and the comment below will become the contents of the drop-in file [Service] MemoryMax=2048M CPUQuota=100% ### Edits below this comment will be discarded
Reload unit files and restart monitoring services
systemctl daemon-reload systemctl restart ceph-mgr.target
Create crush rules for logically grouping NVME and SSD OSDs
root@server1:~# ceph osd crush rule create-replicated nvme_rule default host nvme root@server1:~# ceph osd crush rule create-replicated ssd_rule default host ssd root@server1:~# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "nvme_rule", "type": 1, "steps": [ { "op": "take", "item": -15, "item_name": "default~nvme" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 2, "rule_name": "ssd_rule", "type": 1, "steps": [ { "op": "take", "item": -2, "item_name": "default~ssd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } ]
Create an RBD pool and initialize
Enough care must be taken to identify the PG size when a pool is created. Though it is possible to alter it after creation, it would impact the cluster’s performance as rebalancing would occur based on changes in PG size. Click here to calculate the PG Size.
Create two pools, one with nvme_rule and one with ssd_rule, and both with a replication factor of two.
ceph osd pool create ssdpool 512 512 replicated ssd_rule --size=2 while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done ceph osd pool application enable ssdpool rbd rbd pool init ssdpool ceph osd pool create nvmepool 1024 1024 replicated nvme_rule --size=2 while [ $(ceph -s | grep creating -c) -gt 0 ];do echo -n .;sleep 1; done ceph osd pool application enable nvmepool rbd rbd pool init nvmepool Create a block device and mount it locally.
rbd create --size 10G --pool nvmepool nvmebd rbd map nvmebd --pool nvmepool mkfs.ext4 /dev/rbd0 (Change the device name if a different one gets created) mkdir /root/test mount /dev/rbd0 /root/test