Ceph + KVM : 2. Installation – Ceph Storage

Previous: Planning / Preparing servers Next: Installing KVM

The plan is to use 10.0.4.0/24 for the public network and 10.0.5.0/24 for the cluster network.

DNS Entries updated as follows

10.0.4.1 ceph1
10.0.4.2 ceph2
10.0.4.3 ceph3
10.0.4.4 ceph4

10.0.5.1 csync1
10.0.5.2 csync2
10.0.5.3 csync3
10.0.5.3 csync4

As part of planning and preparing servers, we enabled passwordless, key-based SSH access between servers, a prerequisite for ceph installation.

Add the ceph repository

wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add -
apt-add-repository 'deb https://download.ceph.com/debian-squid/ bookworm main'
apt -y update
apt -y upgrade

Install Ceph on all servers and remove cephadm (not using it), numactl for CPU pinning of OSD services.

apt -y install ceph python3-packaging numactl libhugetlbfs-bin libhugetlbfs0
apt remove --purge cephadm
apt reinstall python3-cryptography
apt reinstall ceph-mgr

Generate a unique uuid for FS.

uuidgen

b115cfad-cce9-4404-a9eb-e821e856bbfd

Create a ceph configuration file /etc/ceph/ceph.conf (with only one monitor node to start with)

[global]
cluster_network = 10.0.5.0/24
public_network = 10.0.4.0/24
fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a
mon host = 10.0.4.1
mon initial members = ceph1

[mon]
mon allow pool delete = true
mon_max_pg_per_osd = 1024

[mon.ceph1]
host = ceph1 mon
addr = 10.0.4.1

[osd]
osd crush update on start = true
bluestore_rocksdb_options = memtable_huge_page_size=1048576
osd_memory_target = 8G
mon_osd_down_out_interval = 180
osd_min_in_ratio = 0.75
osd_op_threads = 12
bluestore_cache_size = 4294967296 # 4 GiB
bluestore_cache_max_dirty = 2147483648
bluestore_cache_deferred_read = true

Notes
    We have enabled 'huge pages' support with a size of 1GM. So, adding "bluestore_rocksdb_options = memtable_huge_page_size=1048576" will enable OSD to use these, improving performance.
    We have planned to reserve 10G per OSD (systemd configuration), which should be sufficient to cover 8G OSD memory target
    When 25% of the OSDs are down, get into read only mode - ensure data consistency

Create a keyring for the cluster and generate a monitor secret key.

ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'

Generate an administrator keyring, and generate a client.admin user and add the user to the keyring

ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'

Generate a bootstrap-osd keyring, and generate a client.bootstrap-osd user and add the user to the keyring

ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd' --cap mgr 'allow r'

Import generated keys

ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring

Generate a monitor map

FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'})
monmaptool --create --add ceph1 10.0.4.1 --fsid $FSID /etc/ceph/monmap

Copy the generated map and configuration files to other nodes

scp /etc/ceph/* ceph2:/etc/ceph/
scp /etc/ceph/* ceph3:/etc/ceph/
scp /etc/ceph/* ceph4:/etc/ceph/
scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph2:/var/lib/ceph/bootstrap-osd
scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph3:/var/lib/ceph/bootstrap-osd
scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph4:/var/lib/ceph/bootstrap-osd
ssh ceph2 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*"
ssh ceph3 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*"
ssh ceph4 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*"

Configure and enable the monitor daemon

A default data directory on the monitor host
Populate the monitor daemon with the monitor map and keyring
Enable messenger v2 protocol

export NODENAME=ceph1
mkdir /var/lib/ceph/mon/ceph-$NODENAME
ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring
chown ceph:ceph /etc/ceph/ceph.*
chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd
systemctl enable --now ceph-mon@$NODENAME
ceph mon enable-msgr2
ceph config set mon auth_allow_insecure_global_id_reclaim false

Create ceph volumes on all servers

Note: Carving out separate block devices for block.wal and block.data will reduce the latencies of write operations

No specific references to calculating size of WAL and DATA partitions – simple ad-hoc basis – 8G WAL / 32 G Block data per 1 TB RAW storage

Used the following snippet to create Ceph volumes on nvme disks (replace ‘x’ with nvme number)

Note: passing –osd-id does not help during ceph-volume lvm create – TODO update script to handle.

sleep is not required, however to avoid any challenges have included sleep.

#!/bin/bash
export devs="nvme0,nvme1,nvme2"
export osds="0,1,2"

# Split into arrays
IFS=',' read -ra dev_arr <<< "$devs"
IFS=',' read -ra osd_arr <<< "$osds"

for counter in "${!dev_arr[@]}"; do
  OSD="${osd_arr[$counter]}"
  DEVNAME="${dev_arr[$counter]}"
  SERVICE_NAME="ceph-osd@${OSD}.service"

  echo "============================================================================"
  echo "Device : $DEVNAME. Stopping ${SERVICE_NAME}"
  systemctl stop "${SERVICE_NAME}"
  sleep 2

  #Remove OSD
  echo " "
  echo "Executing : ceph osd down $OSD"
  ceph osd down ${OSD}
  sleep 2
  echo " "
  echo "Executing : ceph osd out $OSD"
  ceph osd out ${OSD}
  sleep 2
  echo " "
  echo "Executing : ceph osd crush remove $OSD"
  ceph osd crush remove osd.${OSD}
  sleep 2
  echo " "
  echo "Executing : ceph auth del osd.$OSD"
  ceph auth del osd.${OSD}
  sleep 2
  echo " "
  echo "Executing : ceph osd rm $OSD"
  ceph osd rm ${OSD}
  sleep 2

  #Identify WAL size / Block DB size - 8G / 32G per TB respectively..."

  sectors=$(cat /sys/block/${DEVNAME}n1/size)
  bytes=$(( sectors * 512 ))
  tb_bytes=$(( 1000 * 1000 * 1000 * 1000 ))
  tb_ceil=$(( (bytes) / tb_bytes ))

  # In case of smaller storage media keep min calculations against 1 TB
  if [[ tb_ceil -lt 1 ]]; then
    tb_ceil=1
  fi
  wal_gb=$(( tb_ceil * 8 ))
  data_gb=$(( tb_ceil * 32 ))
  echo "Device : $DEVNAME, OSD : $OSD, Bytes : $bytes, WAL : $wal_gb, BLOCKDATA : $data_gb"

  # ZAP ceph volumes if present
  # Assuming only three partitions were created earlier
  
  for (( i=1; i <= 3; i++ )); do
    partition="/dev/${DEVNAME}n1p${i}"
    if [ -b "$partition" ]; then
      sleep 3
      echo " "
      echo "Executing ceph-volume lvm zap $partition"
      ceph-volume lvm zap $partition
      echo " "
      echo "Executing wipefs --all --force --quiet ${partition}"
      wipefs --all --force --quiet ${partition}
      sleep 2
    else
       echo "Partition $partition does not exist."
    fi
  done

  # ZAP reports device in use - work around
  lsblk /dev/${DEVNAME}n1 | grep lvm | sed 's/^....//' | cut -d " " -f1 | grep "^ceph" > dmsetup.txt
  while IFS= read -r line
  do
    echo "Executing dmsetup remove $line"
    dmsetup remove "$line"
  done < dmsetup.txt
  rm -f dmsetup.txt

  sleep 2
  # Extra steps to clear FS
  echo " "
  echo "Executing dd if=/dev/zero of=/dev/${DEVNAME}n1 bs=1M count=10"
  dd if=/dev/zero of=/dev/${DEVNAME}n1 bs=1M count=10
  sleep 3
  echo " "
  echo "Exeucting partprobe /dev/${DEVNAME}n1"
  partprobe /dev/${DEVNAME}n1

  # Create partitions for ceph
  sleep 3
  echo " "
  echo "Creating partitions - Device ${DEVNAME}n1"
  parted /dev/${DEVNAME}n1 --script \
    mklabel gpt \
    unit GB \
    mkpart primary 1MiB "${wal_gb}GB" \
    mkpart primary "${wal_gb}GB" "$((wal_gb + data_gb))GB" \
    mkpart primary "$((wal_gb + data_gb))GB" 100% \

  echo " "
  echo "Creating ceph volumes..."

  #Create the ceph volume
  sleep 3
  echo " "
  echo "Executing ceph-volume lvm --osd-id ${OSD} create --bluestore --data /dev/${DEVNAME}n1p3 --block.db /dev/${DEVNAME}n1p2 --block.wal /dev/${DEVNAME}n1p1 --crush-device-class nvme"
  ceph-volume lvm --osd-id ${OSD} create --bluestore --data /dev/${DEVNAME}n1p3 --block.db /dev/${DEVNAME}n1p2 --block.wal /dev/${DEVNAME}n1p1 --crush-device-class nvme

  #Start OSD service - may not be required
  sleep 3
  echo " "
  echo "Executing systemctl start ceph-osd@$OSD.service"
  systemctl start ceph-osd@${OSD}.service
  sleep 3

  echo " "
  echo "Executing ceph osd in $OSD"
  ceph osd in ${OSD}
done

Used the following snippet to create Ceph volumes on ssd disks

#!/bin/bash
export devs="cephdisk3,cephdisk4,cephdisk5,cephdisk6"
export osds="3,4,5,6"

# Split into arrays

IFS=’,’ read -ra dev_arr <<< “$devs”

IFS=’,’ read -ra osd_arr <<< “$osds”

for counter in “${!dev_arr[@]}”; do

OSD=”${osd_arr[$counter]}”

DISKNAME=”${dev_arr[$counter]}”

echo “Executing ls -ltr /dev/${DISKNAME} | tr -s ‘ ‘ | cut -d ” ” -f11″

DEVICE=`ls -ltr /dev/${DISKNAME} | tr -s ‘ ‘ | cut -d ” ” -f11`

DEVNAME=”${DEVICE:0:3}”

SERVICE_NAME=”ceph-osd@${OSD}.service”

echo “============================================================================”

echo “Device : $DEVNAME. Stopping ${SERVICE_NAME}”

systemctl stop “${SERVICE_NAME}”

sleep 2

#Remove OSD

echo ” “

echo “Executing : ceph osd down $OSD”

ceph osd down ${OSD}

sleep 2

echo ” “

echo “Executing : ceph osd out $OSD”

ceph osd out ${OSD}

sleep 2

echo ” “

echo “Executing : ceph osd crush remove $OSD”

ceph osd crush remove osd.${OSD}

sleep 2

echo ” “

echo “Executing : ceph auth del osd.$OSD”

ceph auth del osd.${OSD}

sleep 2

echo ” “

echo “Executing : ceph osd rm $OSD”

ceph osd rm ${OSD}

sleep 2

#Identify WAL size / Block DB size – 8G / 32G per TB respectively…”

sectors=$(cat /sys/block/${DEVNAME}/size)

bytes=$(( sectors * 512 ))

tb_bytes=$(( 1000 * 1000 * 1000 * 1000 ))

tb_ceil=$(( (bytes) / tb_bytes ))

# In case of smaller storage media keep minimum calculations against 1 TB

if [[ tb_ceil -lt 1 ]]; then

tb_ceil=1

wal_gb=$(( tb_ceil * 8 ))

data_gb=$(( tb_ceil * 32 ))

echo “Device : $DEVNAME, OSD : $OSD, Bytes : $bytes, WAL : $wal_gb, BLOCKDATA : $data_gb”

#ZAP volume is partitions are present – Assuming 3 partitions only

for (( i=1; i <= 3; i++ )); do

partition=”/dev/${DEVNAME}${i}”

if [ -b “$partition” ]; then

sleep 3

echo ” “

echo “Executing ceph-volume lvm zap $partition”

ceph-volume lvm zap $partition

echo ” “

echo “Executing wipefs –all –force –quiet ${partition}”

wipefs –all –force –quiet ${partition}

sleep 2

else

echo ” “

echo “/dev/${DEVNAME}${i} not found.”

done

# ZAP report device is busy – work around

lsblk /dev/${DEVNAME} | grep lvm | sed ‘s/^….//’ | cut -d ” ” -f1 | grep “^ceph” > dmsetup.txt

while IFS= read -r line

echo “Executing dmsetup remove $line”

dmsetup remove “$line”

done < dmsetup.txt

rm -f dmsetup.txt

sleep 2

# Double down on the wiping…

echo ” “

echo “Executing dd if=/dev/zero of=/dev/${DEVNAME} bs=1M count=10”

dd if=/dev/zero of=/dev/${DEVNAME} bs=1M count=10

sleep 3

echo ” “

echo “Exeucting partprobe /dev/${DEVNAME}”

partprobe /dev/${DEVNAME}

# Create partitions for ceph

sleep 3

echo ” “

echo “Creating partitions – Device ${DEVNAME}”

parted /dev/${DEVNAME} –script \

mklabel gpt \

unit GB \

mkpart primary 1MiB “${wal_gb}GB” \

mkpart primary “${wal_gb}GB” “$((wal_gb + data_gb))GB” \

mkpart primary “$((wal_gb + data_gb))GB” 100% \

mkdir /var/lib/ceph/osd/ceph-${OSD}

echo ” “

echo “Creating ceph volumes…”

#Create the ceph volume

sleep 3

echo ” “

echo “Executing ceph-volume lvm –osd-id ${OSD} create –bluestore –data /dev/${DEVNAME}3 –block.db /dev/${DEVNAME}2 –block.wal /dev/${DEVNAME}1 –crush-device-class nvme”

ceph-volume lvm –osd-id ${OSD} create –bluestore –data /dev/${DEVNAME}3 –block.db /dev/${DEVNAME}2 –block.wal /dev/${DEVNAME}1 –crush-device-class ssd

#Start OSD service

systemctl start ceph-osd@${OSD}.service

ceph osd in ${OSD}

done

After creating all ceph volumes, check the status

root@server1:~# ceph -s
cluster:
  id: 577c09c2-c514-471a-aee1-6a0f56c83c3a
    health: HEALTH_WARN
    no active mgr

  services:
    mon: 1 daemons, quorum ceph1 (age 35m)
    mgr: no daemons active
    osd: 18 osds: 18 up (since 22s), 18 in (since 34s)

  data:
    pools: 0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs:

root@server1:~# ceph osd tree

root@server1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 27.75752 root default
-3 6.46405 host server1
0 nvme 1.80479 osd.0 up 1.00000 1.00000
1 nvme 0.90239 osd.1 up 1.00000 1.00000
2 nvme 0.90239 osd.2 up 1.00000 1.00000
3 ssd 0.86600 osd.3 up 1.00000 1.00000
4 ssd 0.86600 osd.4 up 1.00000 1.00000
5 ssd 0.90239 osd.5 up 1.00000 1.00000
6 ssd 0.22009 osd.6 up 1.00000 1.00000
-7 7.07356 host server2
7 nvme 1.80479 osd.7 up 1.00000 1.00000
8 nvme 0.90239 osd.8 up 1.00000 1.00000
9 nvme 0.90239 osd.9 up 1.00000 1.00000
10 ssd 0.86600 osd.10 up 1.00000 1.00000
11 ssd 0.86600 osd.11 up 1.00000 1.00000
12 ssd 0.86600 osd.12 up 1.00000 1.00000
13 ssd 0.86600 osd.13 up 1.00000 1.00000
-10 7.14635 host server3
14 nvme 1.80479 osd.14 up 1.00000 1.00000
15 nvme 0.90239 osd.15 up 1.00000 1.00000
16 nvme 0.90239 osd.16 up 1.00000 1.00000
17 ssd 0.86600 osd.17 up 1.00000 1.00000
18 ssd 0.90239 osd.18 up 1.00000 1.00000
19 ssd 0.90239 osd.19 up 1.00000 1.00000
20 ssd 0.86600 osd.20 up 1.00000 1.00000
-13 7.07356 host server4
21 nvme 1.80479 osd.21 up 1.00000 1.00000
22 nvme 0.90239 osd.22 up 1.00000 1.00000
23 nvme 0.90239 osd.23 up 1.00000 1.00000
24 ssd 0.86600 osd.24 up 1.00000 1.00000
25 ssd 0.86600 osd.25 up 1.00000 1.00000
26 ssd 0.86600 osd.26 up 1.00000 1.00000
27 ssd 0.86600 osd.27 up 1.00000 1.00000
root@server1:~#

Configure monitor services in ceph2 and ceph3 to have three monitor nodes.

Log in to ceph1 and update the monitor map to include ceph2 and ceph3 as monitor nodes.

FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'})
monmaptool --add ceph2 10.0.4.2 --fsid $FSID /etc/ceph/monmap
monmaptool --add ceph3 10.0.4.3 --fsid $FSID /etc/ceph/monmap

Update the ceph configuration file to reflect the new monitor nodes.

[global]
cluster_network = 10.0.5.0/24
public_network = 10.0.4.0/24
fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a
mon host = 10.0.4.1,10.0.4.2,10.0.4.3
mon initial members = ceph1,ceph2,ceph3

[mon]
mon allow pool delete = true
mon_max_pg_per_osd = 3072

[mon.ceph1]
host = ceph1
mon addr = 10.0.4.1

[mon.ceph2]
host = ceph2
mon addr = 10.0.4.2

[mon.ceph3]
host = ceph3
mon addr = 10.0.4.3


[osd]
osd crush update on start = true
bluestore_rocksdb_options = memtable_huge_page_size=1048576
osd_memory_target = 8G
mon_osd_down_out_interval = 180
osd_min_in_ratio = 0.75
osd_op_threads = 12
bluestore_cache_size = 4294967296 # 4 GiB
bluestore_cache_max_dirty = 2147483648
bluestore_cache_deferred_read = true

Copy the generated map and configuration files to other nodes

scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph2:/etc/ceph/
scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph3:/etc/ceph/
scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph4:/etc/ceph/
ssh ceph2 "chown ceph:ceph -R /etc/ceph"
ssh ceph3 "chown ceph:ceph -R /etc/ceph"
ssh ceph4 "chown ceph:ceph -R /etc/ceph"

On each of the new monitor nodes (ceph2 and ceph3)

Log in (SSH) into the node
A default data directory on the monitor host
Populate the monitor daemon with the monitor map and keyring
Enable messenger v2 protocol

export NODENAME=ceph2
mkdir /var/lib/ceph/mon/ceph-$NODENAME
ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring
chown ceph:ceph /etc/ceph/ceph.*
chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd
systemctl enable --now ceph-mon@$NODENAME
ceph mon enable-msgr2
ceph config set mon auth_allow_insecure_global_id_reclaim false

On each of the manager nodes (ceph1, ceph2, ceph3, ceph4)

Log in (SSH) into the node
Create a default data directory on the manager host
Create an authentication key for the manager daemon
Enable the daemon to start on host startup

NODENAME=ceph1
mkdir /var/lib/ceph/mgr/ceph-$NODENAME
ceph auth get-or-create mgr.$NODENAME mon 'allow profile mgr' osd 'allow *' mds 'allow *'
ceph auth get-or-create mgr.$NODENAME | tee /etc/ceph/ceph.mgr.admin.keyring
cp /etc/ceph/ceph.mgr.admin.keyring /var/lib/ceph/mgr/ceph-$NODENAME/keyring
chown ceph:ceph /etc/ceph/ceph.mgr.admin.keyring
chown -R ceph:ceph /var/lib/ceph/mgr/ceph-$NODENAME
systemctl enable --now ceph-mgr@$NODENAME

At this stage, OSD daemons are up; three monitor daemons are up

root@server1:~# ceph -s
cluster:
  id: 577c09c2-c514-471a-aee1-6a0f56c83c3a
  health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 5m)
    mgr: ceph1(active, since 4m), standbys: ceph2, ceph3, ceph4
    osd: 18 osds: 18 up (since 31m), 18 in (since 31m)

  data:
    pools: 1 pools, 1 pgs
    objects: 2 objects, 961 KiB
    usage: 1.4 GiB used, 20 TiB / 20 TiB avail
    pgs: 1 active+clean

root@server1:~#

Setting CPU and RAM limits for OSD daemons (Repeat in all servers)

How much to reserve is unknown. Start with some values based on availability/affordability.

Reserving resources for OSD services

sudo mkdir -p /etc/systemd/system/ceph-osd@.service.d

Create a file /etc/systemd/system/ceph-osd@.service.d/osd.conf with the following contents

[Service]
# cgroup limits
MemoryMax=10240M
#CPUAccounting=true
CPUQuota=400%
# Clear the upstream ExecStart, then launch with your hugepage preload
ExecStart=
ExecStart=/usr/bin/env LD_PRELOAD=/lib/x86_64-linux-gnu/libhugetlbfs.so.0 HUGETLB_MORECORE=yes /usr/bin/ceph-osd --id %i --foreground

Reload and restart the Ceph OSD services

sudo systemctl daemon-reload
sudo systemctl restart ‘ceph-osd@*’

Reserving resources for monitoring services

sudo mkdir -p /etc/systemd/system/ceph-mon@.service.d

Create a file /etc/systemd/system/ceph-mon@.service.d/monitor.conf with the following contents

[Service]
MemoryMax=8192M
#CPUAccounting=true
CPUQuota=400%

Reload and restart the ceph-mon services

sudo systemctl daemon-reload
sudo systemctl restart ‘ceph-mon@*’

Setting CPU and RAM limits for Manager daemons

sudo mkdir -p /etc/systemd/system/ceph-mgr@.service.d

Create a file /etc/systemd/system/ceph-mgr@.service.d/monitor.conf with the following contents

[Service]
MemoryMax=2048M
#CPUAccounting=true
CPUQuota=200%

Reload and restart the ceph-mgr services

sudo systemctl daemon-reload
sudo systemctl restart ‘ceph-mgr@*’

Create crush rules for logically grouping NVME and SSD OSDs

ceph osd crush rule create-replicated nvme_rule default host nvme
ceph osd crush rule create-replicated ssd_rule default host ssd

Configure .mgr pool to use NVME storage

ceph osd pool set .mgr crush_rule nvme_rule

Calculate placement group size (pg_num) for nvme and SSD storage pools. I plan to create the following five pools on nvme storage, with replication set to 3

For RBD

nvmepool

For Rados GW (S3 storage for snapshotting etcd when using RKE2 and other S3 needs)

rgw.buckets.data
rgw.buckets.index
rgw.buckets.log
rgw.control

I have 9 NVMe-based OSDs, so pg_num = (9 x 100/3 x 5) = 60, rounded off to 64 [ 3 is replication, 5 is pool count ]

Create OSD pools associated with NVME.

# 1) RBD 
ceph osd pool create nvmepool 64 64 replicated nvme_rule --size=3
ceph osd pool application enable nvmepool rbd
rbd pool init nvmepool
ceph osd pool set nvmepool pg_autoscale_mode off
ceph osd pool set nvmepool pg_num 64
ceph osd pool set nvmepool pg_num_min 64
ceph osd pool set nvmepool pg_num_max 64

# 2) buckets data (objects)
ceph osd pool create rgw.buckets.data 64 64 replicated nvme_rule --size=3
ceph osd pool application enable rgw.buckets.data rgw

# 3) buckets index (metadata)
ceph osd pool create rgw.buckets.index 64 64 replicated nvme_rule --size=3
ceph osd pool application enable rgw.buckets.index rgw

# 4) buckets log (optional logging)
ceph osd pool create rgw.buckets.log 64 64 replicated nvme_rule --size=3
ceph osd pool application enable rgw.buckets.log rgw

# 5) control (internal RGW control messages)
ceph osd pool create rgw.control 64 64 replicated nvme_rule --size=3
ceph osd pool application enable rgw.control rgw

while [ $(ceph -s | grep creating -c) -gt 0 ];do echo -n .;sleep 1; done

Create OSD pools associated with SSD

I have 16 SSD-based OSDs, so pg_num = (16 x 100/3 x 1)  = 533, rounded off to nearest power of 2, 512 [ 3 is replication, 1 is pool count ]

ceph osd pool create ssdpool 512 512 replicated nvme_rule --size=3
while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done
ceph osd pool application enable ssdpool rbd
rbd pool init ssdpool
ceph osd pool set ssdpool pg_autoscale_mode off
ceph osd pool set ssdpool pg_num 512
ceph osd pool set ssdpool pg_num_min 512
ceph osd pool set ssdpool pg_num_max 512

Optional testing [ Create a block device and mount it locally ]

rbd create --size 10G --pool nvmepool nvmerbd 
rbd map nvmerbd --pool nvmepool

# (Change the device name if a different one gets created)

mkfs.ext4 /dev/rbd0
mkdir /root/test
mount /dev/rbd0 /root/test

Enable ceph-mgr-dashboard; create an administrator user account to access the dashboard

Create a text file with the administrator password to be used (in this case I had created dbpass.txt)

ceph mgr module enable dashboard
ceph config set mgr mgr/dashboard/ssl false
ceph dashboard ac-user-create admin -i dbpass.txt administrator

Now access http://ceph1:8080

TODO: CPU Pinning for Ceph-OSD

In all servers

NUMA:
  NUMA node(s): 2
  NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70
  NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71

Update /etc/systemd/system/ceph-osd@.service.d/osd.conf with the following

[Service]
# cgroup limits
MemoryMax=10240M
#CPUAccounting=true
CPUQuota=400%
# Clear the upstream ExecStart, then launch with your hugepage preload
ExecStart=
ExecStart=/usr/bin/numactl --membind=0 --physcpubind=0,2,4,6 env LD_PRELOAD=/lib/x86_64-linux-gnu/libhugetlbfs.so.0 HUGETLB_MORECORE=yes /usr/bin/ceph-osd --id %i --foreground

Explanation

--membind=0: Allocates memory (hugepages) only from NUMA node 0
--physcpubind=0,2,4,6: Pins the OSD to physical CPUs 0,2,4,6