Preparing Servers for Ceph

Before deploying Ceph, it is essential to prepare the servers correctly at the hardware, BIOS, operating system, and networking layers. Many Ceph performance and stability issues originate from gaps at this stage. This post documents the preparation steps and design decisions used for my Ceph deployment.


1. BIOS Configuration

All servers were first configured at the BIOS level. This step is critical and must be completed before installing or tuning the operating system.

1.1 Memory and NUMA Configuration

  • Node Interleaving: Disabled
    • Location: BIOS → Memory Settings

Disabling Node Interleaving allows the operating system to see distinct NUMA nodes and ensures PCIe devices access local memory efficiently.

1.2 Processor Configuration

Configured under:

BIOS → Processor Settings
  • Logical Processor (Hyper-Threading): Enabled
  • Virtualization Technology (VT-x): Enabled
  • VT-d / IOMMU: Enabled

A cold reboot was performed after all BIOS changes.


2. Boot and Operating System Disk Layout

Each server contains eight front-drive bays. Disk layout was intentionally split between the operating system and Ceph storage devices.

  • 4 × HDDs were configured in RAID 10
    • Purpose: Boot and Operating System
    • Managed by the hardware RAID controller
    • Provides redundancy and predictable OS stability
  • 4 × SSDs in the remaining front bays were configured as Non-RAID (JBOD)
    • Purpose: Ceph OSD devices
    • Directly exposed to the operating system
    • Allows Ceph to manage redundancy, replication, and recovery

This separation ensures that Ceph data disks are not abstracted behind a RAID layer while keeping the OS protected against disk failures.


3. Cluster Size and Host Planning

The Ceph cluster is intentionally planned with four physical servers, even though only three are required for quorum and data replication.

The fourth server is reserved to handle:

  • Host-down or failure scenarios
  • Planned maintenance and rolling hardware upgrades

This allows one host to be taken offline at any time without impacting cluster availability.


4. Dedicated Network Interfaces for Ceph

Each host is equipped with:

  • 4 × 10 Gbps NIC ports
  • 2 × 40 Gbps NIC ports

Ceph uses two separate networks:

  • Public (Client / NBI) Network
  • Cluster (Internal / Replication) Network

To avoid contention:

  • One 10G NIC is dedicated to the public network
  • One 40G NIC is dedicated to the cluster network

5. PCIe Placement and NUMA Awareness

PCIe and NUMA topology were explicitly validated:

  • Mellanox ConnectX-3 Pro 40G NIC → CPU 0
  • Onboard 10G NICs and SATA controller → CPU 0
  • NVMe devices via PCIe adapters → CPU 1

Verified using:

lstopo-no-graphics

(hwloc package)

While asymmetric, this layout performs well when NUMA is configured correctly.

Installing and Preparing Ubuntu 22.04 for Ceph

After completing BIOS configuration and disk layout, the next step is to install and prepare the operating system consistently across all servers.


6. Install Ubuntu 22.04 LTS

  • Install Ubuntu Server 22.04 LTS
  • Do not select the HWE kernel option
  • Use the default GA kernel for maximum stability and driver predictability

This ensures consistent kernel behavior across all Ceph nodes.


7. Remove snapd Completely

Snap packages are not required for Ceph or VM workloads and introduce unnecessary background services.

Verify installed snaps:

snap list

Remove snap packages:

snap remove lxd
snap remove core20
Remove snapd entirely:

apt purge –remove snapd
rm -rf /root/snap/
This reduces system noise and background activity.

Disable Swap

Ceph and virtualization workloads expect predictable memory behavior. Swap must be disabled.

Check swap units:

systemctl list-units | grep swap

Disable swap:

systemctl stop swap.target
systemctl disable swap.target
systemctl mask swap.target
swapoff -a

Edit /etc/fstab and comment out all swap entries.


9. Configure File Descriptor and Process Limits

Increase system limits to support Ceph daemons and VM workloads.

Append the following to /etc/security/limits.conf:

* hard nofile 65536
* soft nofile 65536
* hard nproc  65536
* soft nproc  65536

10. Limit systemd Journal Size

Prevent unbounded disk usage by system logs.

Edit /etc/systemd/journald.conf:

SystemMaxFileSize=512M

Or apply directly:

sed -i "s/#SystemMaxFileSize.*/SystemMaxFileSize=512M/g" /etc/systemd/journald.conf

11. Configure DNS Using systemd-resolved

Ensure consistent and predictable name resolution.

Edit /etc/systemd/resolved.conf:

[Resolve]
DNS=<Your DNS Server IP>
FallbackDNS=8.8.8.8
Domains=<your domain>
DNSStubListener=no

Update /etc/resolv.conf:

ln -fs /run/systemd/resolve/resolv.conf /etc/resolv.conf
systemctl restart systemd-resolved

12. Install Standard Required Packages

Install baseline tools used for validation, troubleshooting, and Ceph preparation:

apt -y install net-tools rsyslog bc fio iperf3 gnupg2 software-properties-common lvm2 nfs-common jq hwloc

13. Clean Up SSH Login Messages

Reduce noise during SSH login.

Edit:

nano /etc/pam.d/ssh

Comment out:

#session optional pam_motd.so motd=/run/motd.dynamic
#session optional pam_motd.so noupdate
#session optional pam_mail.so standard noenv

Disable MOTD news:

echo "ENABLED=0" >> /etc/default/motd-news

14. Disable Unwanted Timer Services

Disable background timers that cause unpredictable load.

List timers:

systemctl list-units | grep timer

Disable and mask:

systemctl stop apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
               motd-news.timer update-notifier-download.timer update-notifier-motd.timer

systemctl disable apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
                  motd-news.timer update-notifier-download.timer update-notifier-motd.timer

systemctl mask apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
               motd-news.timer update-notifier-download.timer update-notifier-motd.timer

15. Disable Unattended Upgrades

All upgrades are managed manually and performed in a controlled manner.

systemctl stop unattended-upgrades.service
systemctl disable unattended-upgrades.service
systemctl mask unattended-upgrades.service

16. Disable Ubuntu Advantage Services

Ubuntu Advantage is not required for this environment.

systemctl stop ubuntu-advantage-tools
systemctl disable ubuntu-advantage-tools
systemctl mask ubuntu-advantage-tools

17. Disable AppArmor and UFW

These security layers are unnecessary in a trusted, firewalled lab environment and can interfere with storage and virtualization workloads.

systemctl stop apparmor ufw
systemctl disable apparmor ufw
systemctl mask apparmor ufw

18. Disk Identification Using udev Rules

Why udev Rules Are Required

Linux block device names such as /dev/sda, /dev/sdb, or /dev/nvme0n1 are not stable.
They can change across:

  • Reboots
  • Firmware updates
  • Controller resets
  • Disk replacements
  • PCIe enumeration order changes

In a Ceph cluster, incorrect disk identification can be catastrophic—leading to accidental OSD recreation or data loss. For this reason, persistent and deterministic disk naming is mandatory.

udev rules allow disks to be referenced using hardware-unique identifiers, ensuring that Ceph always interacts with the intended devices.


Storage Mix and Pool Design

This setup intentionally uses a mix of enterprise and consumer-grade storage:

  • Enterprise SSDs
    • Minimum of 3 enterprise SSDs per host
    • Used exclusively for latency-sensitive and reliability-critical workloads
  • Consumer SSDs and NVMEs
    • All NVMe devices are consumer-grade
    • Used for application workloads where performance is important but endurance requirements are lower

Based on this, two logical Ceph pools are planned:

  • db_pool
    • Backed only by enterprise-grade SSDs
    • Intended for database and critical workloads
  • app_pool
    • Backed by remaining SSDs and all NVMe devices
    • Intended for application and general-purpose workloads

To enforce this separation reliably, disk identification must be explicit and repeatable across all hosts.


Identifying Disks Using Persistent Attributes

For each SSD and NVMe device, retrieve a stable identifier using:

udevadm info --query=property --name=/dev/nvme2n1 | grep ID_SERIAL_SHORT

The command must be executed for every SSD and NVMe device on every host.

The ID_SERIAL_SHORT attribute is:

  • Unique per device
  • Persistent across reboots
  • Independent of kernel enumeration order

Creating udev Rules

Create the rules file:

/etc/udev/rules.d/99-ceph-disks.rules

Sample rules from one of the servers:

ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="5002538e00495343", SYMLINK+="db1"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="5002538e1040492d", SYMLINK+="db2"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="55cd2e4150de0e21", SYMLINK+="db3"

ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="5001b448c4acd56b", SYMLINK+="app1"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="7QH01KJX",            SYMLINK+="app2"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S4EWNX0W614391M",     SYMLINK+="app3"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S4EWNX0W614439F",     SYMLINK+="app4"

This creates stable device paths such as:

/dev/db1  /dev/db2  /dev/db3
/dev/app1 /dev/app2 /dev/app3 /dev/app4

These symbolic links are independent of /dev/sdX or /dev/nvmeXnY naming.


Applying the udev Rules

Reload and apply the rules:

udevadm control --reload-rules
udevadm trigger

Verify:

ls -l /dev/db*
ls -l /dev/app*
readlink -f /dev/db* /dev/app*

Ensure that each symbolic link points to the correct physical device.

19. Passwordless SSH Configuration Between All Hosts

Ceph administration using cephadm relies heavily on SSH-based orchestration.
To ensure smooth cluster bootstrap, host onboarding, and future maintenance, passwordless SSH access must be configured between all nodes.

This includes:

  • Each host to itself
  • Each host to every other host in the cluster

Why Passwordless SSH Is Required

  • cephadm executes commands remotely via SSH
  • Manual password prompts break automation
  • Consistent SSH trust avoids failures during:
    • bootstrap
    • host addition
    • daemon deployment
    • upgrades and maintenance

Establishing this trust early prevents hard-to-debug issues later.


Generate SSH Key (If Not Already Present)

On each host, verify or generate an SSH key:

ls -l ~/.ssh/id_rsa.pub

If not present:
ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa

Configure Passwordless SSH to All Hosts (Including Self)

From each host, run:

ssh-copy-id server1
ssh-copy-id server2
ssh-copy-id server3
ssh-copy-id server4

Repeat this process on every server, including copying the key to itself.

This ensures symmetric, bidirectional SSH access.


Verify SSH Connectivity

From each host, verify:

ssh -o BatchMode=yes server1 'hostname -f'
ssh -o BatchMode=yes server2 'hostname -f'
ssh -o BatchMode=yes server3 'hostname -f'
ssh -o BatchMode=yes server4 'hostname -f'

Each command should return immediately without prompting for a password.

20. Benefits of the Design Choices

The preparation steps and design decisions outlined in this post were made deliberately to optimize performance, reliability, and operational simplicity for a Ceph-based storage platform.

Key Benefits

1. Predictable and Stable Performance

  • BIOS-level NUMA configuration (Node Interleaving disabled) ensures local memory access for PCIe devices.
  • Dedicated 40G networking for cluster traffic delivers near line-rate replication performance.
  • Removing background services (Snap, Unattended Upgrades, Timers) eliminates unpredictable load and latency spikes.

2. Clear Separation of Responsibilities

  • RAID 10 is used exclusively for the operating system, ensuring OS stability and easy recovery.
  • Ceph manages redundancy and fault tolerance for data disks directly, without interference from hardware RAID.
  • Enterprise SSDs and consumer-grade devices are intentionally separated into different Ceph pools.

3. Reduced Risk During Failures and Maintenance

  • Deterministic disk naming via udev rules eliminates ambiguity during reboots, disk replacements, or controller changes.
  • A fourth host is intentionally reserved to handle node failures and rolling maintenance without compromising availability.
  • Conservative capacity planning ensures sufficient headroom during host outages and recovery operations.

4. Operational Simplicity and Repeatability

  • All servers are prepared identically, reducing configuration drift.
  • Explicit validation of NUMA topology, PCIe placement, and networking removes assumptions.
  • The environment is clean, quiet, and predictable before Ceph is introduced.

5. Future-Proofing Without Premature Optimization

  • NUMA and PCIe locality are validated but not over-tuned.
  • Advanced optimizations (CPU pinning, IRQ tuning, RDMA) are deferred until real-world metrics justify them.
  • The design leaves room for incremental improvements without requiring disruptive re-architecture.