Before deploying Ceph, it is essential to prepare the servers correctly at the hardware, BIOS, operating system, and networking layers. Many Ceph performance and stability issues originate from gaps at this stage. This post documents the preparation steps and design decisions used for my Ceph deployment.
1. BIOS Configuration
All servers were first configured at the BIOS level. This step is critical and must be completed before installing or tuning the operating system.
1.1 Memory and NUMA Configuration
- Node Interleaving: Disabled
- Location:
BIOS → Memory Settings
- Location:
Disabling Node Interleaving allows the operating system to see distinct NUMA nodes and ensures PCIe devices access local memory efficiently.
1.2 Processor Configuration
Configured under:
BIOS → Processor Settings
- Logical Processor (Hyper-Threading): Enabled
- Virtualization Technology (VT-x): Enabled
- VT-d / IOMMU: Enabled
A cold reboot was performed after all BIOS changes.
2. Boot and Operating System Disk Layout
Each server contains eight front-drive bays. Disk layout was intentionally split between the operating system and Ceph storage devices.
- 4 × HDDs were configured in RAID 10
- Purpose: Boot and Operating System
- Managed by the hardware RAID controller
- Provides redundancy and predictable OS stability
- 4 × SSDs in the remaining front bays were configured as Non-RAID (JBOD)
- Purpose: Ceph OSD devices
- Directly exposed to the operating system
- Allows Ceph to manage redundancy, replication, and recovery
This separation ensures that Ceph data disks are not abstracted behind a RAID layer while keeping the OS protected against disk failures.
3. Cluster Size and Host Planning
The Ceph cluster is intentionally planned with four physical servers, even though only three are required for quorum and data replication.
The fourth server is reserved to handle:
- Host-down or failure scenarios
- Planned maintenance and rolling hardware upgrades
This allows one host to be taken offline at any time without impacting cluster availability.
4. Dedicated Network Interfaces for Ceph
Each host is equipped with:
- 4 × 10 Gbps NIC ports
- 2 × 40 Gbps NIC ports
Ceph uses two separate networks:
- Public (Client / NBI) Network
- Cluster (Internal / Replication) Network
To avoid contention:
- One 10G NIC is dedicated to the public network
- One 40G NIC is dedicated to the cluster network
5. PCIe Placement and NUMA Awareness
PCIe and NUMA topology were explicitly validated:
- Mellanox ConnectX-3 Pro 40G NIC → CPU 0
- Onboard 10G NICs and SATA controller → CPU 0
- NVMe devices via PCIe adapters → CPU 1
Verified using:
lstopo-no-graphics
(hwloc package)
While asymmetric, this layout performs well when NUMA is configured correctly.
Installing and Preparing Ubuntu 22.04 for Ceph
After completing BIOS configuration and disk layout, the next step is to install and prepare the operating system consistently across all servers.
6. Install Ubuntu 22.04 LTS
- Install Ubuntu Server 22.04 LTS
- Do not select the HWE kernel option
- Use the default GA kernel for maximum stability and driver predictability
This ensures consistent kernel behavior across all Ceph nodes.
7. Remove snapd Completely
Snap packages are not required for Ceph or VM workloads and introduce unnecessary background services.
Verify installed snaps:
snap list
Remove snap packages:
snap remove lxd
snap remove core20
Remove snapd entirely:
apt purge –remove snapd
rm -rf /root/snap/
This reduces system noise and background activity.
Disable Swap
Ceph and virtualization workloads expect predictable memory behavior. Swap must be disabled.
Check swap units:
systemctl list-units | grep swap
Disable swap:
systemctl stop swap.target
systemctl disable swap.target
systemctl mask swap.target
swapoff -a
Edit /etc/fstab and comment out all swap entries.
9. Configure File Descriptor and Process Limits
Increase system limits to support Ceph daemons and VM workloads.
Append the following to /etc/security/limits.conf:
* hard nofile 65536
* soft nofile 65536
* hard nproc 65536
* soft nproc 65536
10. Limit systemd Journal Size
Prevent unbounded disk usage by system logs.
Edit /etc/systemd/journald.conf:
SystemMaxFileSize=512M
Or apply directly:
sed -i "s/#SystemMaxFileSize.*/SystemMaxFileSize=512M/g" /etc/systemd/journald.conf
11. Configure DNS Using systemd-resolved
Ensure consistent and predictable name resolution.
Edit /etc/systemd/resolved.conf:
[Resolve]
DNS=<Your DNS Server IP>
FallbackDNS=8.8.8.8
Domains=<your domain>
DNSStubListener=no
Update /etc/resolv.conf:
ln -fs /run/systemd/resolve/resolv.conf /etc/resolv.conf
systemctl restart systemd-resolved
12. Install Standard Required Packages
Install baseline tools used for validation, troubleshooting, and Ceph preparation:
apt -y install net-tools rsyslog bc fio iperf3 gnupg2 software-properties-common lvm2 nfs-common jq hwloc
13. Clean Up SSH Login Messages
Reduce noise during SSH login.
Edit:
nano /etc/pam.d/ssh
Comment out:
#session optional pam_motd.so motd=/run/motd.dynamic
#session optional pam_motd.so noupdate
#session optional pam_mail.so standard noenv
Disable MOTD news:
echo "ENABLED=0" >> /etc/default/motd-news
14. Disable Unwanted Timer Services
Disable background timers that cause unpredictable load.
List timers:
systemctl list-units | grep timer
Disable and mask:
systemctl stop apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
motd-news.timer update-notifier-download.timer update-notifier-motd.timer
systemctl disable apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
motd-news.timer update-notifier-download.timer update-notifier-motd.timer
systemctl mask apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
motd-news.timer update-notifier-download.timer update-notifier-motd.timer
15. Disable Unattended Upgrades
All upgrades are managed manually and performed in a controlled manner.
systemctl stop unattended-upgrades.service
systemctl disable unattended-upgrades.service
systemctl mask unattended-upgrades.service
16. Disable Ubuntu Advantage Services
Ubuntu Advantage is not required for this environment.
systemctl stop ubuntu-advantage-tools
systemctl disable ubuntu-advantage-tools
systemctl mask ubuntu-advantage-tools
17. Disable AppArmor and UFW
These security layers are unnecessary in a trusted, firewalled lab environment and can interfere with storage and virtualization workloads.
systemctl stop apparmor ufw
systemctl disable apparmor ufw
systemctl mask apparmor ufw
18. Disk Identification Using udev Rules
Why udev Rules Are Required
Linux block device names such as /dev/sda, /dev/sdb, or /dev/nvme0n1 are not stable.
They can change across:
- Reboots
- Firmware updates
- Controller resets
- Disk replacements
- PCIe enumeration order changes
In a Ceph cluster, incorrect disk identification can be catastrophic—leading to accidental OSD recreation or data loss. For this reason, persistent and deterministic disk naming is mandatory.
udev rules allow disks to be referenced using hardware-unique identifiers, ensuring that Ceph always interacts with the intended devices.
Storage Mix and Pool Design
This setup intentionally uses a mix of enterprise and consumer-grade storage:
- Enterprise SSDs
- Minimum of 3 enterprise SSDs per host
- Used exclusively for latency-sensitive and reliability-critical workloads
- Consumer SSDs and NVMEs
- All NVMe devices are consumer-grade
- Used for application workloads where performance is important but endurance requirements are lower
Based on this, two logical Ceph pools are planned:
db_pool- Backed only by enterprise-grade SSDs
- Intended for database and critical workloads
app_pool- Backed by remaining SSDs and all NVMe devices
- Intended for application and general-purpose workloads
To enforce this separation reliably, disk identification must be explicit and repeatable across all hosts.
Identifying Disks Using Persistent Attributes
For each SSD and NVMe device, retrieve a stable identifier using:
udevadm info --query=property --name=/dev/nvme2n1 | grep ID_SERIAL_SHORT
The command must be executed for every SSD and NVMe device on every host.
The ID_SERIAL_SHORT attribute is:
- Unique per device
- Persistent across reboots
- Independent of kernel enumeration order
Creating udev Rules
Create the rules file:
/etc/udev/rules.d/99-ceph-disks.rules
Sample rules from one of the servers:
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="5002538e00495343", SYMLINK+="db1"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="5002538e1040492d", SYMLINK+="db2"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="55cd2e4150de0e21", SYMLINK+="db3"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="5001b448c4acd56b", SYMLINK+="app1"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="7QH01KJX", SYMLINK+="app2"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S4EWNX0W614391M", SYMLINK+="app3"
ACTION=="add|change", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="S4EWNX0W614439F", SYMLINK+="app4"
This creates stable device paths such as:
/dev/db1 /dev/db2 /dev/db3
/dev/app1 /dev/app2 /dev/app3 /dev/app4
These symbolic links are independent of /dev/sdX or /dev/nvmeXnY naming.
Applying the udev Rules
Reload and apply the rules:
udevadm control --reload-rules
udevadm trigger
Verify:
ls -l /dev/db*
ls -l /dev/app*
readlink -f /dev/db* /dev/app*
Ensure that each symbolic link points to the correct physical device.
19. Passwordless SSH Configuration Between All Hosts
Ceph administration using cephadm relies heavily on SSH-based orchestration.
To ensure smooth cluster bootstrap, host onboarding, and future maintenance, passwordless SSH access must be configured between all nodes.
This includes:
- Each host to itself
- Each host to every other host in the cluster
Why Passwordless SSH Is Required
cephadmexecutes commands remotely via SSH- Manual password prompts break automation
- Consistent SSH trust avoids failures during:
- bootstrap
- host addition
- daemon deployment
- upgrades and maintenance
Establishing this trust early prevents hard-to-debug issues later.
Generate SSH Key (If Not Already Present)
On each host, verify or generate an SSH key:
ls -l ~/.ssh/id_rsa.pub
If not present:
ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa
Configure Passwordless SSH to All Hosts (Including Self)
From each host, run:
ssh-copy-id server1
ssh-copy-id server2
ssh-copy-id server3
ssh-copy-id server4
Repeat this process on every server, including copying the key to itself.
This ensures symmetric, bidirectional SSH access.
Verify SSH Connectivity
From each host, verify:
ssh -o BatchMode=yes server1 'hostname -f'
ssh -o BatchMode=yes server2 'hostname -f'
ssh -o BatchMode=yes server3 'hostname -f'
ssh -o BatchMode=yes server4 'hostname -f'
Each command should return immediately without prompting for a password.
20. Benefits of the Design Choices
The preparation steps and design decisions outlined in this post were made deliberately to optimize performance, reliability, and operational simplicity for a Ceph-based storage platform.
Key Benefits
1. Predictable and Stable Performance
- BIOS-level NUMA configuration (Node Interleaving disabled) ensures local memory access for PCIe devices.
- Dedicated 40G networking for cluster traffic delivers near line-rate replication performance.
- Removing background services (Snap, Unattended Upgrades, Timers) eliminates unpredictable load and latency spikes.
2. Clear Separation of Responsibilities
- RAID 10 is used exclusively for the operating system, ensuring OS stability and easy recovery.
- Ceph manages redundancy and fault tolerance for data disks directly, without interference from hardware RAID.
- Enterprise SSDs and consumer-grade devices are intentionally separated into different Ceph pools.
3. Reduced Risk During Failures and Maintenance
- Deterministic disk naming via udev rules eliminates ambiguity during reboots, disk replacements, or controller changes.
- A fourth host is intentionally reserved to handle node failures and rolling maintenance without compromising availability.
- Conservative capacity planning ensures sufficient headroom during host outages and recovery operations.
4. Operational Simplicity and Repeatability
- All servers are prepared identically, reducing configuration drift.
- Explicit validation of NUMA topology, PCIe placement, and networking removes assumptions.
- The environment is clean, quiet, and predictable before Ceph is introduced.
5. Future-Proofing Without Premature Optimization
- NUMA and PCIe locality are validated but not over-tuned.
- Advanced optimizations (CPU pinning, IRQ tuning, RDMA) are deferred until real-world metrics justify them.
- The design leaves room for incremental improvements without requiring disruptive re-architecture.