Next: Installing Ceph
I wanted to ensure the storage system backing up my enterprise solutions is highly scalable, reliable, and cost-effective. It should support block storage, S3, and CSI drivers for K8S.
- The storage cluster should satisfy the applications’ IOPS requirements.
- Time-tested CSI driver support for the storage is a must.
- The CSI driver’s support for over-the-wire encryption will be an added advantage.
- The hardware requirements should not be high or prohibitive for SMEs planning to opt for on-premise deployments.
- It should be open-source with a decent number of production deployments.
- Active community support and decent documentation are required.
- Initial deployment complexities are not a significant issue, as they are one-time and can be documented.
- Version upgrades and security patch applications should be possible.
- It would be an added advantage if they were simple and time-tested.
- Should support dynamic volume expansion
- Although a Kubernetes feature, migrating pods and stateful sets should be automatic and consistent in the event of server or worker node failures.
The decision is to use Ceph RBD and not Ceph FS. I am new to using Ceph and am starting with this decision. All worker node VMs will be backed by block storage in Ceph RBD. The Ceph CSI Driver (RBD) will enable block objects for PV requirements.
Available hardware: 4 x Dell R630 with the following configuration
- Intel X710 daughter board – 4 x 10G SFP+
Mellanox Connect X Pro CX314A – Dual 40G QSFP ports – Slot 3 with default Bifurcation.
Dual M.2 NVME PCIe 3 Adapter, populated with 2 NVME SSD, Slot 1 – x4x4x4x4 Bifurcation
Dual M.2 NVME PCIe 3 Adapter, populated with only one NVME SSD, Slot 2 with default bifurcation
4 x 10K RPM 1.2 TB SAS Drive in RAID 10 (PERC Mini 730p) for Boot / OS
MicroTik CRS-326-24S+2Q+RM
- Bridge Mode
- 9000 MTU
- Ensures Maximum utilisation of available network throughput
- Consistent 9.86 Gbps in iperf3 test results
- 20 SFP+ ports connected to servers (5 ports each)
- Direct Attach Cable
- 1 SFP+ port connected to TP-Link Router (uplink)
- Direct Attach Cable
TP-Link ER8411
- 10G SFP+ LAN Port connected to Cloud Router Switch
- 10G SFP+ WAN Port connected to Gateway UTM device
- UTM device gateway is 1G RJ45, used Microtik S+RJ10 Copper Module
Arista 7050QX-32S (Switch with 40G ports)
After factory reset, Configure ports 29 to 36 with the following configuration
enable
configure terminal
interface Ethernet29
switchport ! enable L2 mode
switchport mode access ! set as access port
switchport access vlan 1 ! untagged in VLAN 1
mtu 9000 ! jumbo frames up to 9 000 bytes
no shutdown ! bring the port up
exit
end
write memory
Configure Management port - Connect Management Port to one of the LAN ports in TP Link 8411 and configure the following configuration
enable
configure terminal
interface Management1
no vrf forwarding management ! ensure it’s in the default VRF
ip address 10.0.0.10/16 ! assign your management IP
no shutdown ! bring Ma1 up
exit
ip routing ! enable the L3 engine
ip route 0.0.0.0/0 10.0.0.1 ! default‐gateway for off‐net traffic
end
write memory
Networking
- 10.0.1.0/24 – Servers/VM Mgmt/Access – 10G Interface
- 10.0.2.0/24 – Messaging – 10G Interface
- 10.0.4.0 /24 – Ceph public network – 10G Interface
- 10.0.5.0/24 – Ceph Cluster – 40G Interface
- 10.0.3.0/24 – MariaDB NB access – 10G Interface
- 10.0.6.0/24 – MariaDB sync – 40G Interface
Install Debian 12.11 on the server(s)
Log in to or SSH into the server using the user account configured during installation.
Switch to the root user account
su -
Enable remote login for root user account
sed -i "s/#PermitRootLogin prohibit-password/PermitRootLogin yes/g" /etc/ssh/sshd_config
sed -i "s/#PubkeyAuthentication/PubkeyAuthentication/g" /etc/ssh/sshd_config
sed -i "s/#AuthorizedKeysFile/AuthorizedKeysFile/g" /etc/ssh/sshd_config
sed -i "s/# StrictHostKeyChecking ask/ StrictHostKeyChecking no/g" /etc/ssh/ssh_config
sed -i "s/session optional pam_motd.so/#session optional pam_motd.so/g" /etc/pam.d/sshd
sed -i "s/session optional pam_motd.so/#session optional pam_motd.so/g" /etc/pam.d/sshd
service ssh restart
Remove CDROM from the apt sources list.
sed -i '/deb cdrom/d' /etc/apt/sources.list
Log out and log in as the root user.
On all servers, a RAID 10-based virtual disk is used for booting purposes. The Dell 730P mini RAID controller backs this. I had observed that the name of this device would change on every reboot, sometimes to /dev/sde and sometimes to /dev/sdb. This was impacting the naming of other SSDs on the system, causing them to be named differently. Once the Ceph OSD LVM volume is created, it does not matter. I wanted to ensure that the NVMe and SSDs used for Ceph OSDs had fixed names, and hence opted to use udevadm to achieve this. I had used the device serial number to map it to a specific symlink name, such as /dev/ceph1, etc. All that is required is to get the serial number and update /etc/udev/rules.d/99-local-disks.rules.
root@server4:~# udevadm info --query=all --name=/dev/sde
P: /devices/pci0000:00/0000:00:01.0/0000:02:00.0/host0/target0:0:5/0:0:5:0/block/sde
M: sde
U: block
T: disk
D: b 8:64
N: sde
L: 0
S: disk/by-id/ata-SAMSUNG_MZ7LH960HAJR-00005_S45NNA0N633824
...
...
E: ID_MODEL=SAMSUNG_MZ7LH960HAJR-00005
E: ID_MODEL_ENC=SAMSUNG\x20MZ7LH960HAJR-00005\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_REVISION=HXT7404Q
E: ID_SERIAL=SAMSUNG_MZ7LH960HAJR-00005_S45NNA0N633824
E: ID_SERIAL_SHORT=S45NNA0N633824
E: ID_ATA_WRITE_CACHE=1
...
...
E: DEVLINKS=/dev/disk/by-id/ata-SAMSUNG_MZ7LH960HAJR-00005_S45NNA0N633824 /dev/disk/by-diskseq/8 /dev/disk/by-id/wwn-0x5002538e106178bb /dev/disk/by-path/pci-0000:02:00.0-scsi-0:0:5:0
E: TAGS=:systemd:
E: CURRENT_TAGS=:systemd:
Add an entry in /etc/udev/rules.d/99-local-disks.rules. Sample contents from one server
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="24127Y4A1S05", SYMLINK+="cephdisk0"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S4EWNX0W614451P", SYMLINK+="cephdisk1"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S4EWNX0W614347H", SYMLINK+="cephdisk2"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S45NNA1N432494", SYMLINK+="cephdisk3"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S45NNA1N403233", SYMLINK+="cephdisk4"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S45NNA0N498554", SYMLINK+="cephdisk5"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S45NNA0N633824", SYMLINK+="cephdisk6"
Install required packages
apt -y install net-tools rsyslog systemd-resolved bc fio iperf3 gnupg2 software-properties-common lvm2 nfs-common jq
Configure DNS server IP – Let the system resolve and manage DNS server configuration. (Note 10.0.0.1 is the IP of the UTM device with DNS services also)
ln -fs /run/systemd/resolve/resolv.conf /etc/resolv.conf
sed -i "s/#DNS=/DNS=10.0.0.1/g" /etc/systemd/resolved.conf
sed -i "s/#Domains=/Domains=<domain.net>/g" /etc/systemd/resolved.conf
systemctl daemon-reload
systemctl restart systemd-resolved
Disable daily update timers.
systemctl stop apt-daily-upgrade.timer apt-daily.timer apparmor
systemctl disable apt-daily-upgrade.timer apt-daily.timer apparmor
Configure the NTP server and restart the NTP services
timedatectl set-timezone "Asia/Kolkata"
sed -i "s/#NTP=/NTP=time\.google\.com/g" /etc/systemd/timesyncd.conf
sed -i "s/#FallbackNTP=ntp.ubuntu.com/FallbackNTP=ntp\.ubuntu\.com/g" /etc/systemd/timesyncd.conf
systemctl daemon-reload
systemctl stop systemd-timesyncd.service
systemctl start systemd-timesyncd.service
Configure the max file size of the journal
sed -i "s/#SystemMaxFileSize.*/SystemMaxFileSize=512M/g" /etc/systemd/journald.conf
Configure the maximum number of files open and the maximum number of processes.
echo "* hard nofile 65536" >> /etc/security/limits.conf
echo "* soft nofile 65536" >> /etc/security/limits.conf
echo "* hard nproc 65536" >> /etc/security/limits.conf
echo "* soft nproc 65536" >> /etc/security/limits.conf
Disable IPV6 and enable huge pages.
sed -i 's/GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="ipv6.disable=1 default_hugepagesz=1G hugepagesz=1G hugepages=64 transparent_hugepage=never "/g' /etc/default/grub
I will be adding NVMe disks as OSDs, baremetal installation of ceph, disabling iommu is recommended [ https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/]
sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT=.*/GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=off iommu=off amd_iommu=off"/g' /etc/default/grub
update-grub
Pending TODO: Adding ‘isolcpus=0,2,4,6′ – This ensures that CPUs are reserved for OSD services, create other.slice and configure dedicated use of 0,2,4,6 to OSD???
Configure network interface configuration – update /etc/network/interfaces (Sample provided here is from server1 – Change it as required)
source /etc/network/interfaces.d/*
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
allow-hotplug eno1
iface eno1 inet static
address 10.0.1.1/16
gateway 10.0.0.1
mtu 9000
allow-hotplug eno2
iface eno2 inet static
address 10.0.2.1/24
mtu 9000
allow-hotplug eno3
iface eno3 inet static
address 10.0.3.1/24
mtu 9000
allow-hotplug eno4
iface eno4 inet static
address 10.0.4.1/24
mtu 9000
allow-hotplug enp3s0
iface enp3s0 inet static
address 10.0.5.1/24
mtu 9000
allow-hotplug enp3s0d1
iface enp3s0d1 inet static
address 10.0.6.1/24
mtu 9000
Reboot the server
Enable key-based, passwordless SSH between servers
echo "Host ceph1" > ~/.ssh/config
echo " Hostname ceph1" >> ~/.ssh/config
echo " User root" >> ~/.ssh/config
echo "Host ceph2" >> ~/.ssh/config
echo " Hostname ceph2" >> ~/.ssh/config
echo " User root" >> ~/.ssh/config
echo "Host ceph3" >> ~/.ssh/config
echo " Hostname ceph3" >> ~/.ssh/config
echo " User root" >> ~/.ssh/config
echo "Host ceph4" >> ~/.ssh/config
echo " Hostname ceph4" >> ~/.ssh/config
echo " User root" >> ~/.ssh/config
ssh-keygen -q -N ""
ssh-keygen -f '/root/.ssh/known_hosts' -R 'ceph1'
ssh-keygen -f '/root/.ssh/known_hosts' -R 'ceph2'
ssh-keygen -f '/root/.ssh/known_hosts' -R 'ceph3'
ssh-keygen -f '/root/.ssh/known_hosts' -R 'ceph4'
ssh-copy-id ceph1
ssh-copy-id ceph2
ssh-copy-id ceph3
ssh-copy-id ceph4
Tuning NIC throughput – add the following at the end of /etc/sysctl.conf
# Increase socket buffer sizes for high-speed TCP
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 87380 67108864
# Disable reverse-path filtering (rp_filter) globally and on our 40G NICs
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.enp3s0.rp_filter = 0
net.ipv4.conf.enp3s0d1.rp_filter= 0
net.core.netdev_max_backlog = 250000
net.core.somaxconn = 16384
Basic iperf3 check between iperf-client server2 and iperf-server server1
40G Interface test
root@server2:~# iperf3 -c 10.0.5.1
Connecting to host 10.0.5.1, port 5201
[ 5] local 10.0.5.2 port 44328 connected to 10.0.5.1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.20 GBytes 36.0 Gbits/sec 0 1.93 MBytes
[ 5] 1.00-2.00 sec 4.21 GBytes 36.2 Gbits/sec 0 2.04 MBytes
[ 5] 2.00-3.00 sec 4.20 GBytes 36.1 Gbits/sec 0 2.04 MBytes
[ 5] 3.00-4.00 sec 4.21 GBytes 36.2 Gbits/sec 0 2.82 MBytes
[ 5] 4.00-5.00 sec 4.21 GBytes 36.2 Gbits/sec 0 2.82 MBytes
[ 5] 5.00-6.00 sec 4.21 GBytes 36.1 Gbits/sec 0 2.82 MBytes
[ 5] 6.00-7.00 sec 4.21 GBytes 36.2 Gbits/sec 0 2.82 MBytes
[ 5] 7.00-8.00 sec 4.22 GBytes 36.2 Gbits/sec 0 2.82 MBytes
[ 5] 8.00-9.00 sec 4.22 GBytes 36.2 Gbits/sec 0 2.82 MBytes
[ 5] 9.00-10.00 sec 4.22 GBytes 36.2 Gbits/sec 4 2.53 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 42.1 GBytes 36.2 Gbits/sec 4 sender
[ 5] 0.00-10.00 sec 42.1 GBytes 36.2 Gbits/sec receiver
iperf Done.
root@server2:~#
10G Interface test
root@server2:~# iperf3 -c 10.0.2.1
Connecting to host 10.0.2.1, port 5201
[ 5] local 10.0.2.2 port 33756 connected to 10.0.2.1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.16 GBytes 9.92 Gbits/sec 83 1.51 MBytes
[ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.51 MBytes
[ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.51 MBytes
[ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.51 MBytes
[ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes
[ 5] 5.00-6.00 sec 1.15 GBytes 9.91 Gbits/sec 0 1.54 MBytes
[ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.60 MBytes
[ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.66 MBytes
[ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.72 MBytes
[ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.72 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 83 sender
[ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver
iperf Done.
root@server2:~#
Enable THP in madvise mode
cat <<EOF | sudo tee /etc/systemd/system/enable-thp.service
[Unit]
Description=Enable Transparent Huge Pages in madvise mode
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled'
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reexec
sudo systemctl daemon-reload
sudo systemctl enable --now enable-thp.service
Huge pages verification
root@server1:# cat /proc/meminfo | grep “HugePages”
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 32000
HugePages_Free: 32000
HugePages_Rsvd: 0
HugePages_Surp: 0
root@server1:~#
Some notes on selecting NVME
-
Check for PLP (Power loss protection)
-
Prefer TLC over QLC
-
TBW – the higher, the better