Skip to content

Home Lab

Notes from my learning sessions

Menu
Menu

Ceph + KVM : 2. Installation – Ceph Storage

Posted on September 22, 2024April 7, 2025 by sandeep

Previous: Planning / Preparing servers                                                  Next: Installing KVM

The plan is to use 10.0.4.0/24 for the public network and 10.0.5.0/24 for the cluster network.

As part of planning/preparing server /etc/hosts was updated in all servers

10.0.4.1 ceph1
10.0.4.2 ceph2
10.0.4.3 ceph3
10.0.4.4 ceph4
10.0.5.1 ceph-private1
10.0.5.2 ceph-private2
10.0.5.3 ceph-private3
10.0.5.3 ceph-private4

As part of planning and preparing servers, we enabled passwordless, key-based ssh access between servers, which is a pre-requisite for ceph installation.

Add ceph repository

wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add -
apt-add-repository 'deb https://download.ceph.com/debian-reef/ bookworm main'
apt -y update
apt -y upgrade

Install Ceph on all servers and remove cephadm (not using it), numactl for CPU pinning of OSD services

apt -y install ceph python3-packaging numactl libhugetlbfs-bin libhugetlbfs0
apt remove --purge cephadm
apt reinstall python3-cryptography
apt reinstall ceph-mgr

Login (ssh) into node ceph1 (first node)

Generate a unique uuid for FS.

uuidgen

b115cfad-cce9-4404-a9eb-e821e856bbfd

Create ceph configuration file /etc/ceph/ceph.conf (with only one monitor node to start with)

[global]
cluster_network = 10.0.5.0/24
public_network = 10.0.4.0/24
fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a
mon host = 10.0.4.1
mon initial members = ceph1

[mon]
mon allow pool delete = true
mon_max_pg_per_osd = 3072

[mon.ceph1]
host = ceph1
mon addr = 10.0.4.1

[osd]
osd crush update on start = true
bluestore_rocksdb_options = memtable_huge_page_size=2048
osd_memory_target = 14G
mon_osd_down_out_interval= 180
osd_min_in_ratio = 0.75
osd_op_threads = 6

Notes
    We have enabled 'huge pages' support with a size of 2M. So, adding "bluestore_rocksdb_options = memtable_huge_page_size=2048" will enable OSD to use these, improving performance.
    We have planned to reserve 16G per OSD (systemd configuration), which should be sufficient to cover 14G OSD memory target
    We would like a 3 minute downtime of an OSD to be treated as a need to rebalance pgs with available OSDs
    When 25% of the OSDs are down, get into read only mode - ensure data consistency

Create a keyring for the cluster and generate a monitor secret key.

ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'

Generate an administrator keyring, generate a client.admin user and add the user to the keyring

ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'

Generate a bootstrap-osd keyring, generate a client.bootstrap-osd user and add the user to the keyring

ceph-authtool --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd' --cap mgr 'allow r'

Import generated keys

ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring

Generate monitor map

FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'})
monmaptool --create --add ceph1 10.0.4.1 --fsid $FSID /etc/ceph/monmap

Copy the generated map and configuration files to other nodes
scp /etc/ceph/* ceph2:/etc/ceph/
scp /etc/ceph/* ceph3:/etc/ceph/
scp /etc/ceph/* ceph4:/etc/ceph/
scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph2:/var/lib/ceph/bootstrap-osd
scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph3:/var/lib/ceph/bootstrap-osd
scp /var/lib/ceph/bootstrap-osd/ceph.keyring ceph4:/var/lib/ceph/bootstrap-osd
ssh ceph2 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*"
ssh ceph3 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*"
ssh ceph4 "chown ceph:ceph -R /etc/ceph /var/lib/ceph/bootstrap-osd/*"

Configure and enable monitor daemon

  • A default data directory on the monitor host
  • Populate the monitor daemon with the monitor map and keyring
  • Enable messenger v2 protocol
export NODENAME=ceph1
mkdir /var/lib/ceph/mon/ceph-$NODENAME
ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring
chown ceph:ceph /etc/ceph/ceph.*
chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd
systemctl enable --now ceph-mon@$NODENAME
ceph mon enable-msgr2
ceph config set mon auth_allow_insecure_global_id_reclaim false

Create ceph volumes in all servers

Used the following snippet to create ceph-volumes on nvme disks (replace ‘x’ with nvme number)

export devs="nvme0 nvme1 nvme2"
for DEVNAME in $devs; do
  if lsblk | grep "${DEVNAME}n1p1"; then
    echo "Clearing /dev/${DEVNAME}n1"
    wipefs -a /dev/${DEVNAME}n1 
  else
    echo "No partitions found on /dev/${DEVNAME}n1"
  fi
  parted /dev/${DEVNAME}n1 mklabel gpt
  parted -s /dev/${DEVNAME}n1 "mkpart primary 0% 100%"
  wipefs -a /dev/${DEVNAME}n1p1
  ceph-volume lvm create --data /dev/${DEVNAME}n1p1 --crush-device-class nvme
done

Used the following snippet to create ceph-volumes on ssd disks 

export devs="sda sde"
for DEVNAME in $devs; do
  if lsblk | grep "${DEVNAME}1"; then
    echo "Clearing /dev/${DEVNAME}"
    wipefs -a /dev/${DEVNAME} 
  else
    echo "No partitions found on /dev/${DEVNAME}"
  fi
  parted /dev/${DEVNAME} mklabel gpt
  parted -s /dev/${DEVNAME} "mkpart primary 0% 100%"
  wipefs -a /dev/${DEVNAME}1
  ceph-volume lvm create --data /dev/${DEVNAME}1 --crush-device-class ssd
done

After creating all ceph volumes, check the status

root@server1:~# ceph -s
cluster:
  id: 577c09c2-c514-471a-aee1-6a0f56c83c3a
    health: HEALTH_WARN
    no active mgr

  services:
    mon: 1 daemons, quorum ceph1 (age 35m)
    mgr: no daemons active
    osd: 18 osds: 18 up (since 22s), 18 in (since 34s)

  data:
    pools: 0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage: 0 B used, 0 B / 0 B avail
    pgs:

root@server1:~# ceph osd tree

ID CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
-1       19.79449 root default
-3        5.38527 host server1
0  nvme   1.81929 osd.0         up 1.00000  1.00000
1  nvme   0.90970 osd.1         up 1.00000  1.00000
2  nvme   0.90970 osd.2         up 1.00000  1.00000
3  ssd    0.87329 osd.3         up 1.00000  1.00000
4  ssd    0.87329 osd.4         up 1.00000  1.00000
-7        5.38527 host server2
5  nvme   1.81929 osd.5         up 1.00000  1.00000
6  nvme   0.90970 osd.6         up 1.00000  1.00000
7  nvme   0.90970 osd.7         up 1.00000  1.00000
8  ssd    0.87329 osd.8         up 1.00000  1.00000
9  ssd    0.87329 osd.9         up 1.00000  1.00000
-10       3.63869 host server3
10 nvme   1.81929 osd.10        up 1.00000  1.00000
11 nvme   0.90970 osd.11        up 1.00000  1.00000
12 nvme   0.90970 osd.12        up 1.00000  1.00000
-13       5.38527 host server4
13 nvme   1.81929 osd.13        up 1.00000  1.00000
14 nvme   0.90970 osd.14        up 1.00000  1.00000
15 nvme   0.90970 osd.15        up 1.00000  1.00000
16 ssd    0.87329 osd.16        up 1.00000  1.00000
17 ssd    0.87329 osd.17        up 1.00000  1.00000
root@server1:~#

Configure monitor services in ceph2 and ceph3 to have three monitor nodes.

Login to ceph1 and update the monitor map to include ceph2 and ceph3 as monitor nodes.

FSID=$(grep "^fsid" /etc/ceph/ceph.conf | awk {'print $NF'})
monmaptool --add ceph2 10.0.4.2 --fsid $FSID /etc/ceph/monmap
monmaptool --add ceph3 10.0.4.3 --fsid $FSID /etc/ceph/monmap

Update the ceph configuration file to reflect the new monitor nodes.

[global]
cluster_network = 10.0.5.0/24
public_network = 10.0.4.0/24
fsid = 577c09c2-c514-471a-aee1-6a0f56c83c3a
mon host = 10.0.4.1,10.0.4.2,10.0.4.3
mon initial members = ceph1,ceph2,ceph3

[mon]
mon allow pool delete = true
mon_max_pg_per_osd = 3072

[mon.ceph1]
host = ceph1
mon addr = 10.0.4.1

[mon.ceph2]
host = ceph2
mon addr = 10.0.4.2

[mon.ceph3]
host = ceph3
mon addr = 10.0.4.3

[osd]
osd crush update on start = true
bluestore_rocksdb_options = memtable_huge_page_size=2048
osd_memory_target = 14G
mon_osd_down_out_interval = 180
osd_min_in_ratio = 0.75
osd_op_threads = 6

Copy the generated map and configuration files to other nodes

scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph2:/etc/ceph/
scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph3:/etc/ceph/
scp /etc/ceph/ceph.conf /etc/ceph/monmap ceph4:/etc/ceph/
ssh ceph2 "chown ceph:ceph -R /etc/ceph"
ssh ceph3 "chown ceph:ceph -R /etc/ceph"
ssh ceph4 "chown ceph:ceph -R /etc/ceph"

On each of the new monitor nodes (ceph2 and ceph3)

  • Login (ssh) into the node
  • A default data directory on the monitor host
  • Populate the monitor daemon with the monitor map and keyring
  • Enable messenger v2 protocol
export NODENAME=ceph2
mkdir /var/lib/ceph/mon/ceph-$NODENAME
ceph-mon --cluster ceph --mkfs -i $NODENAME --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring
chown ceph:ceph /etc/ceph/ceph.*
chown -R ceph:ceph /var/lib/ceph/mon/ceph-$NODENAME /var/lib/ceph/bootstrap-osd
systemctl enable --now ceph-mon@$NODENAME
ceph mon enable-msgr2
ceph config set mon auth_allow_insecure_global_id_reclaim false

On each of the manager nodes (ceph1, ceph2, ceph3, ceph4)

  • Login (ssh) into the node
  • Create a default data directory on the manager host
  • Create an authentication key for the manager daemon
  • Enable the daemon to start on host startup
NODENAME=ceph1
mkdir /var/lib/ceph/mgr/ceph-$NODENAME
ceph auth get-or-create mgr.$NODENAME mon 'allow profile mgr' osd 'allow *' mds 'allow *'
ceph auth get-or-create mgr.$NODENAME | tee /etc/ceph/ceph.mgr.admin.keyring
cp /etc/ceph/ceph.mgr.admin.keyring /var/lib/ceph/mgr/ceph-$NODENAME/keyring
chown ceph:ceph /etc/ceph/ceph.mgr.admin.keyring
chown -R ceph:ceph /var/lib/ceph/mgr/ceph-$NODENAME
systemctl enable --now ceph-mgr@$NODENAME

At this stage, OSD daemons are up, three monitor daemons are up

root@server1:~# ceph -s
cluster:
  id: 577c09c2-c514-471a-aee1-6a0f56c83c3a
  health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 5m)
    mgr: ceph1(active, since 4m), standbys: ceph2, ceph3, ceph4
    osd: 18 osds: 18 up (since 31m), 18 in (since 31m)

  data:
    pools: 1 pools, 1 pgs
    objects: 2 objects, 961 KiB
    usage: 1.4 GiB used, 20 TiB / 20 TiB avail
    pgs: 1 active+clean

root@server1:~#

Setting CPU and RAM limits for OSD daemons (Repeat in all servers)

How much to reserve is unknown. Start with some values based on availability/affordability.

Reserving resources.

systemctl edit ceph-osd@.service
### Anything between here and the comment below will become the contents of the drop-in file
[Service]
MemoryMax=12288M
CPUQuota=400%
### Edits below this comment will be discarded

Setting CPU and RAM limits for Monitor daemons 

systemctl edit ceph-mon@.service
### Anything between here and the comment below will become the contents of the drop-in file
[Service]
MemoryMax=8192M
CPUQuota=400%
### Edits below this comment will be discarded

Setting CPU and RAM limits for Manager daemons

systemctl edit ceph-mgr@.service
### Anything between here and the comment below will become the contents of the drop-in file
[Service]
MemoryMax=2048M
CPUQuota=200%
### Edits below this comment will be discarded

Create crush rules for logically grouping NVME and SSD OSDs

ceph osd crush rule create-replicated nvme_rule default host nvme
ceph osd crush rule create-replicated ssd_rule default host ssd

Configure .mgr pool to use NVME storage

ceph osd pool set .mgr crush_rule nvme_rule

Create an RBD pool and initialize

Create two pools, one with nvme_rule and one with ssd_rule, and both with a replication factor of three (default even if not configured)

ceph osd pool create ssdpool 512 512 replicated ssd_rule --size=3
while [ $(ceph -s | grep creating -c) -gt 0 ]; do echo -n .;sleep 1; done
ceph osd pool application enable ssdpool rbd
rbd pool init ssdpool

ceph osd pool create nvmepool 1024 1024 replicated nvme_rule --size=3
while [ $(ceph -s | grep creating -c) -gt 0 ];do echo -n .;sleep 1; done 
ceph osd pool application enable nvmepool rbd
rbd pool init nvmepool

Create a block device and mount it locally.


rbd create --size 10G --pool nvmepool nvmebd
rbd map nvmebd --pool nvmepool
mkfs.ext4 /dev/rbd0   (Change the device name if a different one gets created)
mkdir /root/test
mount /dev/rbd0 /root/test

Enable ceph-mgr-dashboard; create an administrator user account to access the dashboard

Create a text file with the administrator password to be used (in this case I had created dbpass.txt)

ceph mgr module enable dashboard
ceph config set mgr mgr/dashboard/ssl false
ceph dashboard ac-user-create admin -i dbpass.txt administrator

Now access http://ceph1:8080

CPU Pinning for Ceph-OSD

In all servers

NUMA:
  NUMA node(s): 2
  NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70
  NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71

Override configuration per ceph-osd.  Example “systemctl edit ceph-osd@0.service” and configure the following

[Service]
ExecStart=
ExecStart=/usr/bin/numactl --membind=0 --physcpubind=0,2,4,6 env LD_PRELOAD=/lib/x86_64-linux-gnu/libhugetlbfs.so.0 HUGETLB_MORECORE=yes /usr/bin/ceph-osd --id %i --foreground

Explanation

  • --membind=0: Allocates memory (hugepages) only from NUMA node 0

  • --physcpubind=0,2,4,6: Pins the OSD to physical CPUs 0,2,4,6

 

Recent Posts

  • Ceph + KVM: 4. Orchestrating Ceph RBD backed VMs on KVM Hosts
  • Rabbit MQ Cluster + HAProxy + Keepalived
  • Install and configure MariaDB / Galera cluster
  • Ceph + KVM : 3. Installing KVM, cutsomized monitoring scripts
  • Ceph + KVM : 5. Service checks and CLI commands
  • Ceph + KVM : 2. Installation – Ceph Storage
  • Ceph + KVM : 1. Planning and preparing for Ceph Storage
  • Openstack Xena on Ubuntu 20.04 – Cinder
  • Preparing custom Debian 11 MATE image
  • Setup Ubuntu 20.04 repository mirror server

Archives

  • April 2025
  • March 2025
  • October 2024
  • September 2024
  • April 2022
  • March 2022
  • February 2022
  • December 2021
  • October 2021
  • September 2021
  • October 2020
  • February 2020
  • January 2020
  • December 2019
© 2025 Home Lab | Powered by Minimalist Blog WordPress Theme