Skip to content

Home Lab

Notes from my learning sessions

Menu
Menu

Ceph + KVM : 1. Planning and preparing for Ceph Storage

Posted on September 22, 2024June 13, 2025 by sandeep

                                                                                                                                                                          Next:  Installing Ceph

I wanted to ensure the storage system backing up my enterprise solutions is highly scalable, reliable, and cost-effective.  It should support block storage, S3, and CSI drivers for K8S.

  • The storage cluster should satisfy the applications’ IOPS requirements.
  • Time-tested CSI driver support for the storage is a must.
  • The CSI driver’s support for over-the-wire encryption will be an added advantage.
  • The hardware requirements should not be high or prohibitive for SMEs planning to opt for on-premise deployments.
  • It should be open-source with a decent number of production deployments.
  • Active community support and decent documentation are required.
    • Initial deployment complexities are not a significant issue, as they are one-time and can be documented.
  • Version upgrades and security patch applications should be possible.
    • It would be an added advantage if they were simple and time-tested.
  • Should support dynamic volume expansion
  • Although a Kubernetes feature, migrating pods and stateful sets should be automatic and consistent in the event of server or worker node failures.

The decision is to use Ceph RBD and not Ceph FS. I am new to using Ceph and am starting with this decision. All worker node VMs will be backed by block storage in Ceph RBD. The Ceph CSI Driver (RBD) will enable block objects for PV requirements.

Available hardware: 4 x Dell R630 with the following configuration

  • Intel X710 daughter board – 4 x 10G SFP+

    Mellanox Connect X Pro CX314A – Dual 40G QSFP ports – Slot 3 with default Bifurcation.

    Dual M.2 NVME PCIe 3 Adapter, populated with 2 NVME SSD, Slot 1 – x4x4x4x4 Bifurcation

    Dual M.2 NVME PCIe 3 Adapter, populated with only one NVME SSD, Slot 2 with default bifurcation

    4 x 10K RPM 1.2 TB SAS Drive in RAID 10 (PERC Mini 730p) for Boot / OS

 

MicroTik CRS-326-24S+2Q+RM

  • Bridge Mode
  • 9000 MTU
    • Ensures Maximum utilisation of available network throughput
    • Consistent 9.86 Gbps in iperf3 test results
  • 20 SFP+ ports connected to servers (5 ports each)
    • Direct Attach Cable
  • 1 SFP+ port connected to TP-Link Router (uplink)
    • Direct Attach Cable

TP-Link ER8411

  • 10G SFP+ LAN Port connected to Cloud Router Switch
  • 10G SFP+ WAN Port connected to Gateway UTM device
    • UTM device gateway is 1G RJ45, used Microtik S+RJ10 Copper Module

Arista 7050QX-32S (Switch with 40G ports)

After factory reset, Configure ports 29 to 36 with the following configuration

enable
configure terminal
interface Ethernet29
switchport   ! enable L2 mode
switchport mode access ! set as access port
switchport access vlan 1 ! untagged in VLAN 1
mtu 9000 ! jumbo frames up to 9 000 bytes
no shutdown ! bring the port up
exit
end
write memory

Configure Management port - Connect Management Port to one of the LAN ports in TP Link 8411 and configure the following configuration

enable

configure terminal

interface Management1
no vrf forwarding management ! ensure it’s in the default VRF
ip address 10.0.0.10/16 ! assign your management IP
no shutdown ! bring Ma1 up
exit

ip routing ! enable the L3 engine
ip route 0.0.0.0/0 10.0.0.1 ! default‐gateway for off‐net traffic

end
write memory

Networking

  • 10.0.1.0/24  – Servers/VM Mgmt/Access – 10G Interface
  • 10.0.2.0/24  – Messaging              – 10G Interface
  • 10.0.4.0 /24 – Ceph public network    – 10G Interface
  • 10.0.5.0/24  – Ceph Cluster           – 40G Interface
  • 10.0.3.0/24  – MariaDB NB access      – 10G Interface
  • 10.0.6.0/24  – MariaDB sync           – 40G Interface

    Install Debian 12.11 on the server(s)

    Log in to or SSH into the server using the user account configured during installation.

    Switch to the root user account

    su -
    Enable remote login for root user account
    sed -i "s/#PermitRootLogin prohibit-password/PermitRootLogin yes/g" /etc/ssh/sshd_config
    sed -i "s/#PubkeyAuthentication/PubkeyAuthentication/g" /etc/ssh/sshd_config
    sed -i "s/#AuthorizedKeysFile/AuthorizedKeysFile/g" /etc/ssh/sshd_config
    sed -i "s/# StrictHostKeyChecking ask/ StrictHostKeyChecking no/g" /etc/ssh/ssh_config
    sed -i "s/session optional pam_motd.so/#session optional pam_motd.so/g" /etc/pam.d/sshd
    sed -i "s/session optional pam_motd.so/#session optional pam_motd.so/g" /etc/pam.d/sshd
    service ssh restart

    Remove CDROM from the apt sources list.

    sed -i '/deb cdrom/d' /etc/apt/sources.list

    Log out and log in as the root user.

    On all servers, a RAID 10-based virtual disk is used for booting purposes.  The Dell 730P mini RAID controller backs this.  I had observed that the name of this device would change on every reboot, sometimes to /dev/sde and sometimes to /dev/sdb.  This was impacting the naming of other SSDs on the system, causing them to be named differently.  Once the Ceph OSD LVM volume is created, it does not matter. I wanted to ensure that the NVMe and SSDs used for Ceph OSDs had fixed names, and hence opted to use udevadm to achieve this.  I had used the device serial number to map it to a specific symlink name, such as /dev/ceph1, etc.  All that is required is to get the serial number and update /etc/udev/rules.d/99-local-disks.rules.

    root@server4:~# udevadm info --query=all --name=/dev/sde
    P: /devices/pci0000:00/0000:00:01.0/0000:02:00.0/host0/target0:0:5/0:0:5:0/block/sde
    M: sde
    U: block
    T: disk
    D: b 8:64
    N: sde
    L: 0
    S: disk/by-id/ata-SAMSUNG_MZ7LH960HAJR-00005_S45NNA0N633824
    ...
    ...
    E: ID_MODEL=SAMSUNG_MZ7LH960HAJR-00005

    E: ID_MODEL_ENC=SAMSUNG\x20MZ7LH960HAJR-00005\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
    E: ID_REVISION=HXT7404Q
    E: ID_SERIAL=SAMSUNG_MZ7LH960HAJR-00005_S45NNA0N633824
    E: ID_SERIAL_SHORT=S45NNA0N633824
    E: ID_ATA_WRITE_CACHE=1
    ...
    ...
    E: DEVLINKS=/dev/disk/by-id/ata-SAMSUNG_MZ7LH960HAJR-00005_S45NNA0N633824 /dev/disk/by-diskseq/8 /dev/disk/by-id/wwn-0x5002538e106178bb /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:5:0

    E: TAGS=:systemd:
    E: CURRENT_TAGS=:systemd:

    Add an entry in /etc/udev/rules.d/99-local-disks.rules.  Sample contents from one server

    ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="24127Y4A1S05", SYMLINK+="cephdisk0"
    ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S4EWNX0W614451P", SYMLINK+="cephdisk1"
    ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S4EWNX0W614347H", SYMLINK+="cephdisk2"
    ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S45NNA1N432494", SYMLINK+="cephdisk3"
    ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S45NNA1N403233", SYMLINK+="cephdisk4"
    ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S45NNA0N498554", SYMLINK+="cephdisk5"
    ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S45NNA0N633824", SYMLINK+="cephdisk6"

    Install required packages

    apt -y install net-tools rsyslog systemd-resolved bc fio iperf3 gnupg2 software-properties-common lvm2 nfs-common jq

    Configure DNS server IP – Let the system resolve and manage DNS server configuration.  (Note 10.0.0.1 is the IP of the UTM device with DNS services also)

    ln -fs /run/systemd/resolve/resolv.conf /etc/resolv.conf
    sed -i "s/#DNS=/DNS=10.0.0.1/g" /etc/systemd/resolved.conf
    sed -i "s/#Domains=/Domains=<domain.net>/g" /etc/systemd/resolved.conf
    systemctl daemon-reload

    systemctl restart systemd-resolved

    Disable daily update timers.

    systemctl stop apt-daily-upgrade.timer apt-daily.timer apparmor
    systemctl disable apt-daily-upgrade.timer apt-daily.timer apparmor

    Configure the NTP server and restart the NTP services

    timedatectl set-timezone "Asia/Kolkata"
    sed -i "s/#NTP=/NTP=time\.google\.com/g" /etc/systemd/timesyncd.conf
    sed -i "s/#FallbackNTP=ntp.ubuntu.com/FallbackNTP=ntp\.ubuntu\.com/g" /etc/systemd/timesyncd.conf
    systemctl daemon-reload

    systemctl stop systemd-timesyncd.service
    systemctl start systemd-timesyncd.service

    Configure the max file size of the journal

    sed -i "s/#SystemMaxFileSize.*/SystemMaxFileSize=512M/g" /etc/systemd/journald.conf

    Configure the maximum number of files open and the maximum number of processes.

    echo "* hard nofile 65536" >> /etc/security/limits.conf
    echo "* soft nofile 65536" >> /etc/security/limits.conf
    echo "* hard nproc 65536" >> /etc/security/limits.conf
    echo "* soft nproc 65536" >> /etc/security/limits.conf

    Disable IPV6 and enable huge pages.  

    sed -i 's/GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="ipv6.disable=1 default_hugepagesz=1G hugepagesz=1G hugepages=64 transparent_hugepage=never "/g' /etc/default/grub

    I will be adding NVMe disks as OSDs, baremetal installation of ceph, disabling iommu is recommended [ https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/]

    sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT=.*/GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=off iommu=off amd_iommu=off"/g' /etc/default/grub
    update-grub

    Pending TODO:  Adding  ‘isolcpus=0,2,4,6′ – This ensures that CPUs are reserved for OSD services, create other.slice and configure dedicated use of 0,2,4,6 to OSD???

    Configure network interface configuration – update /etc/network/interfaces (Sample provided here is from server1 – Change it as required)

    source /etc/network/interfaces.d/*

    # The loopback network interface
    auto lo
    iface lo inet loopback

    # The primary network interface
    allow-hotplug eno1
    iface eno1 inet static
    address 10.0.1.1/16
    gateway 10.0.0.1
    mtu 9000

    allow-hotplug eno2
    iface eno2 inet static
    address 10.0.2.1/24
    mtu 9000

    allow-hotplug eno3
    iface eno3 inet static
    address 10.0.3.1/24
    mtu 9000

    allow-hotplug eno4
    iface eno4 inet static
    address 10.0.4.1/24
    mtu 9000

    allow-hotplug enp3s0
    iface enp3s0 inet static
    address 10.0.5.1/24
    mtu 9000

    allow-hotplug enp3s0d1
    iface enp3s0d1 inet static
    address 10.0.6.1/24
    mtu 9000

    Reboot the server

    Enable key-based, passwordless SSH between servers

    echo "Host ceph1" > ~/.ssh/config
    echo " Hostname ceph1" >> ~/.ssh/config
    echo " User root" >> ~/.ssh/config
    echo "Host ceph2" >> ~/.ssh/config
    echo " Hostname ceph2" >> ~/.ssh/config
    echo " User root" >> ~/.ssh/config
    echo "Host ceph3" >> ~/.ssh/config
    echo " Hostname ceph3" >> ~/.ssh/config
    echo " User root" >> ~/.ssh/config
    echo "Host ceph4" >> ~/.ssh/config
    echo " Hostname ceph4" >> ~/.ssh/config
    echo " User root" >> ~/.ssh/config

    ssh-keygen -q -N ""

    ssh-keygen -f '/root/.ssh/known_hosts' -R 'ceph1'

    ssh-keygen -f '/root/.ssh/known_hosts' -R 'ceph2'
    ssh-keygen -f '/root/.ssh/known_hosts' -R 'ceph3'
    ssh-keygen -f '/root/.ssh/known_hosts' -R 'ceph4'
    ssh-copy-id ceph1
    ssh-copy-id ceph2
    ssh-copy-id ceph3
    ssh-copy-id ceph4

    Tuning NIC throughput – add the following at the end of /etc/sysctl.conf

    # Increase socket buffer sizes for high-speed TCP
    net.core.rmem_max = 67108864
    net.core.wmem_max = 67108864
    net.ipv4.tcp_rmem = 4096 87380 67108864
    net.ipv4.tcp_wmem = 4096 87380 67108864

    # Disable reverse-path filtering (rp_filter) globally and on our 40G NICs
    net.ipv4.conf.default.rp_filter = 0
    net.ipv4.conf.all.rp_filter = 0
    net.ipv4.conf.enp3s0.rp_filter = 0
    net.ipv4.conf.enp3s0d1.rp_filter= 0

    net.core.netdev_max_backlog = 250000
    net.core.somaxconn = 16384

    Basic iperf3 check between iperf-client server2 and iperf-server server1

    40G Interface test

    root@server2:~# iperf3 -c 10.0.5.1
    Connecting to host 10.0.5.1, port 5201
    [ 5] local 10.0.5.2 port 44328 connected to 10.0.5.1 port 5201
    [ ID] Interval Transfer Bitrate Retr Cwnd
    [ 5] 0.00-1.00 sec 4.20 GBytes 36.0 Gbits/sec 0 1.93 MBytes
    [ 5] 1.00-2.00 sec 4.21 GBytes 36.2 Gbits/sec 0 2.04 MBytes
    [ 5] 2.00-3.00 sec 4.20 GBytes 36.1 Gbits/sec 0 2.04 MBytes
    [ 5] 3.00-4.00 sec 4.21 GBytes 36.2 Gbits/sec 0 2.82 MBytes
    [ 5] 4.00-5.00 sec 4.21 GBytes 36.2 Gbits/sec 0 2.82 MBytes
    [ 5] 5.00-6.00 sec 4.21 GBytes 36.1 Gbits/sec 0 2.82 MBytes
    [ 5] 6.00-7.00 sec 4.21 GBytes 36.2 Gbits/sec 0 2.82 MBytes
    [ 5] 7.00-8.00 sec 4.22 GBytes 36.2 Gbits/sec 0 2.82 MBytes
    [ 5] 8.00-9.00 sec 4.22 GBytes 36.2 Gbits/sec 0 2.82 MBytes
    [ 5] 9.00-10.00 sec 4.22 GBytes 36.2 Gbits/sec 4 2.53 MBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bitrate Retr
    [ 5] 0.00-10.00 sec 42.1 GBytes 36.2 Gbits/sec 4 sender
    [ 5] 0.00-10.00 sec 42.1 GBytes 36.2 Gbits/sec receiver

    iperf Done.
    root@server2:~#

    10G Interface test

    root@server2:~# iperf3 -c 10.0.2.1
    Connecting to host 10.0.2.1, port 5201
    [ 5] local 10.0.2.2 port 33756 connected to 10.0.2.1 port 5201
    [ ID] Interval Transfer Bitrate Retr Cwnd
    [ 5] 0.00-1.00 sec 1.16 GBytes 9.92 Gbits/sec 83 1.51 MBytes
    [ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.51 MBytes
    [ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.51 MBytes
    [ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.51 MBytes
    [ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
    [ 5] 5.00-6.00 sec 1.15 GBytes 9.91 Gbits/sec 0 1.54 MBytes
    [ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.60 MBytes
    [ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.66 MBytes
    [ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.72 MBytes
    [ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.72 MBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval Transfer Bitrate Retr
    [ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 83 sender
    [ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver

    iperf Done.
    root@server2:~#

    Enable THP in madvise mode

    cat <<EOF | sudo tee /etc/systemd/system/enable-thp.service
    [Unit]
    Description=Enable Transparent Huge Pages in madvise mode
    After=multi-user.target

    [Service]
    Type=oneshot
    ExecStart=/bin/bash -c 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled'

    [Install]
    WantedBy=multi-user.target
    EOF

    sudo systemctl daemon-reexec
    sudo systemctl daemon-reload
    sudo systemctl enable --now enable-thp.service

    Huge pages verification

    root@server1:# cat /proc/meminfo | grep “HugePages”

    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    FileHugePages: 0 kB
    HugePages_Total: 32000
    HugePages_Free: 32000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    root@server1:~#

    Some notes on selecting NVME

    • Check for PLP (Power loss protection)

    • Prefer TLC over QLC

    • TBW – the higher, the better

    Recent Posts

    • Ceph + KVM: 4. Orchestrating Ceph RBD backed VMs on KVM Hosts
    • Rabbit MQ Cluster + HAProxy + Keepalived
    • Install and configure MariaDB / Galera cluster
    • Ceph + KVM : 3. Installing KVM, Autostart virsh pools and vms on restart.
    • Ceph + KVM : 5. Service checks and CLI commands
    • Ceph + KVM : 2. Installation – Ceph Storage
    • Ceph + KVM : 1. Planning and preparing for Ceph Storage
    • Openstack Xena on Ubuntu 20.04 – Cinder
    • Preparing custom Debian 11 MATE image
    • Setup Ubuntu 20.04 repository mirror server

    Archives

    • April 2025
    • March 2025
    • October 2024
    • September 2024
    • April 2022
    • March 2022
    • February 2022
    • December 2021
    • October 2021
    • September 2021
    • October 2020
    • February 2020
    • January 2020
    • December 2019
    © 2025 Home Lab | Powered by Minimalist Blog WordPress Theme