In a home lab that also serves as an R&D environment, frequent rebuilds, upgrades, and full reinstall cycles are inevitable. To avoid losing critical documentation, Docker repositories, source code, and other persistent assets, I decided to dedicate one physical server exclusively for management. This system now hosts all essential services and a lightweight K3s cluster used to test container images before pushing them to the production RKE2 environment.

The management server itself is a 10-bay Dell R630 configured for high reliability. It uses three 2.5″ HDDs in a RAID 5 array for the operating system and seven 1 TB SSDs in a RAID 10 virtual disk dedicated to running VMs. This layout provides both performance and redundancy, ensuring that critical data and essential services remain protected even in the event of disk failures.

From past experience with KVM hosts, I learned that VM virtual disks backed by hardware RAID do not always retain consistent device naming after a server reboot. This inconsistency can break automation and orchestration workflows. To address this, I created udev rules that match each disk’s ID_PART_ENTRY_UUID and map it to a stable, human-friendly device name that corresponds to the VM name. This ensures reliable device paths that can be safely used by the VM orchestration scripts.

This post captures all the steps involved—starting from the base OS installation—required to configure and prepare this server as the dedicated management backbone of the lab.

Note: Two L4 GPUs are plugged in into this server. Conscious decision on using Ubuntu 24.04 taken – Tested the drivers / cuda-toolkit, fast-attn compilation etc.

BIOS configuration (Dell R630)

System Profile & CPU Optimization

These settings ensure your 72 logical threads are not throttled by power-saving features.

System Profile: Set to Performance.
- Effect: Disables C-states and C1E, keeping the CPU at maximum frequency and eliminating wake-up latency.
Logical Processor:Enabled.
- Effect: Provides the full 72 threads (36 physical cores + Hyper-threading).
CPU Interconnect Radio Speed: Maximum Data Rate.
Virtualization Technology: Enabled.

Memory & Interrupt Handling

Critical for preventing the “stuttering” issues and handling high-speed I/O.

Memory Operating Mode: Optimizer Mode.
Node Interleaving: Disabled (NUMA Enabled).
x2APIC Mode: Enabled.
Effect: Essential for handling the large number of interrupts generated by 72 threads and high-speed Mellanox adapters.

I/O & Hardware Passthrough (L4 GPU)

Required for the vfio-pci driver to successfully claim the GPU for the Golden Image.

Integrated Devices:
- SR-IOV Global Enable: Enabled.
- I/OAT DMA Engine: Enabled.
Slot Disablement: Ensure the PCIe slots containing your L4 GPUs (Groups 4 and 5) are Enabled.
Memory Mapped I/O above 4GB:Enabled.
- Effect: Allows the system to address the large VRAM on the L4 cards.

Networking (Mellanox ConnectX-3 Pro)

Embedded NIC1 Port 1/2: Set to Enabled.
PCIe Slot (Mellanox): Ensure this is set to Max Link Speed (Gen3 x8).

1. QPI Link Power Management

Recommended Setting: Disabled
Why: The QuickPath Interconnect (QPI) is the high-speed “highway” between your two physical CPUs.
- The Default Behavior: When the link is idle, the BIOS tries to put the QPI into a low-power state (L0s or L1).
- The Problem: Waking the link back up to full speed takes time. This creates a bottleneck when a process on CPU 0 needs to access memory or a PCIe device (like your L4 GPU) controlled by CPU 1.
- The Benefit: Disabling this keeps the interconnect “always-on” at maximum bandwidth, ensuring consistent sub-microsecond latency for NUMA-aware workloads like your PostgreSQL instance.

2. Logical Processor Idling

Recommended Setting: Disabled
Why: This setting allows the OS to put unused logical processors (Hyper-threaded cores) into a deep sleep state to save power.
- The Problem: In a 72-thread environment, the kernel’s scheduler frequently moves tasks across cores. If the target logical processor is “idling” in a deep sleep, there is a significant delay (latency) to re-activate it.
- The Benefit: Disabling this ensures that even if only 10% of your CPU is being used, all 72 threads remain in a “halt” state rather than a “sleep” state, allowing for instantaneous execution the moment a new task arrives from RabbitMQ or your Spring Boot application.

POST OS installation configurations

Remove snapd Completely

Snap introduces background services and timers not required for production VMs.

List installed snaps:

snap list

Remove them:

snap remove lxd
snap remove core20
snap remove snapd

Uninstall snapd:

apt purge --remove snapd
rm -rf /root/snap/

Disable Swap

Swap is not required for our workload profile.

Check swap units:

systemctl list-units | grep swap

Stop, disable, and mask all swap units:

systemctl stop swap.target
systemctl disable swap.target
systemctl mask swap.target
swapoff -a
rm -f /swap.img

Edit /etc/fstab and comment out any swap entries.

Configure NIC and Bridge Interfaces

To support high-performance networking and KVM virtualization, all NICs are configured with an MTU of 9000. The system includes four 10 GbE interfaces (Intel X710: eno1–eno4) and two 40 GbE interfaces (Mellanox ConnectX-3 Pro: enp3s0 and enp3s0d1). Since this server hosts multiple VMs, each physical interface is bridged to provide direct, high-throughput connectivity to the guests.

The first interface (eno1) serves as the management network and uses a /16 address. This is intentional because the default route (via 10.0.0.1, the UDM Pro security gateway) resides on the same network. All remaining interfaces are assigned /24 subnets so that east-west traffic stays localized and is switched internally by the Arista 7050QX.

Below is the complete netplan configuration:

# /etc/netplan/50-cloud-init.yaml
# This file is generated from cloud-init data. To disable cloud-init
# network configuration, create:
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
# with: network: {config: disabled}

network:
  version: 2
  ethernets:
    eno1: { mtu: 9000 }
    eno2: { mtU: 9000 }
    eno3: { mtu: 9000 }
    eno4: { mtu: 9000 }
    enp3s0: { mtu: 9000 }
    enp3s0d1: { mtu: 9000 }

  bridges:
    br1:
      interfaces: [eno1]
      addresses: [10.0.1.5/16]
      routes:
        - to: default
          via: 10.0.0.1
      parameters:
        stp: false
        forward-delay: 0
      mtu: 9000

    br2:
      interfaces: [eno2]
      addresses: [10.0.2.5/24]
      parameters:
        stp: false
        forward-delay: 0
      mtu: 9000

    br3:
      interfaces: [eno3]
      addresses: [10.0.3.5/24]
      parameters:
        stp: false
        forward-delay: 0
      mtu: 9000

    br4:
      interfaces: [eno4]
      addresses: [10.0.4.5/24]
      parameters:
        stp: false
        forward-delay: 0
      mtu: 9000

    br5:
      interfaces: [enp3s0]
      addresses: [10.0.5.5/24]
      parameters:
        stp: false
        forward-delay: 0
      mtu: 9000

    br6:
      interfaces: [enp3s0d1]
      addresses: [10.0.6.5/24]
      parameters:
        stp: false
        forward-delay: 0
      mtu: 9000

The resulting routing table reflects the designated default route and the isolated /24 subnets:

default via 10.0.0.1 dev br1 proto static
10.0.0.0/16 dev br1 proto kernel scope link src 10.0.1.5
10.0.2.0/24 dev br2 proto kernel scope link src 10.0.2.5
10.0.3.0/24 dev br3 proto kernel scope link src 10.0.3.5
10.0.4.0/24 dev br4 proto kernel scope link src 10.0.4.5
10.0.5.0/24 dev br5 proto kernel scope link src 10.0.5.5
10.0.6.0/24 dev br6 proto kernel scope link src 10.0.6.5
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown

Disable Cloud-Init Networking

Since the network configuration is fully managed through custom Netplan files, cloud-init’s networking component must be disabled to prevent it from overwriting settings on reboot. The following steps disable cloud-init networking and clean any previous state:

# Disable cloud-init from managing network configuration
sudo tee /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg > /dev/null <<EOF
network: {config: disabled}
EOF

# Clean cloud-init state and logs
sudo cloud-init clean --logs

Configure NTP

To ensure accurate time synchronization—critical for logs, certificates, KVM operations, and distributed systems—the server is configured to use Google’s public NTP service with Ubuntu’s NTP servers as fallback. The timezone is also set to match the local region.

# Set system timezone
sudo timedatectl set-timezone "Asia/Kolkata"

# Configure primary and fallback NTP servers
sudo sed -i "s/#NTP=/NTP=time.google.com/g" /etc/systemd/timesyncd.conf
sudo sed -i "s/#FallbackNTP=ntp.ubuntu.com/FallbackNTP=ntp.ubuntu.com/g" /etc/systemd/timesyncd.conf

# Reload and restart the time synchronization service
sudo systemctl daemon-reload
sudo systemctl stop systemd-timesyncd.service
sudo systemctl start systemd-timesyncd.service

You can verify synchronization using:

timedatectl status

Configure System Limits and Core Settings

Set Maximum Journal File Size

To prevent uncontrolled growth of systemd journal logs, configure a maximum file size of 512 MB:

sudo sed -i "s/#SystemMaxFileSize.*/SystemMaxFileSize=512M/g" /etc/systemd/journald.conf

Increase File Descriptor and Process Limits

For a server that runs KVM, containers, orchestration scripts, and various management services, increasing the maximum number of open files and processes is essential:

echo "* hard nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "* soft nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "* hard nproc 65536"  | sudo tee -a /etc/security/limits.conf
echo "* soft nproc 65536"  | sudo tee -a /etc/security/limits.conf

Disable Nouveau and Prepare for NVIDIA Drivers

This server uses NVIDIA GPUs, is part of a vGPU-enabled environment, blacklist the nouveau driver to avoid conflicts:

echo -e "blacklist nouveau\noptions nouveau modeset=0" | sudo tee /etc/modprobe.d/disable-nouveau.conf


Force it into the Boot Image:

The kernel only looks at /etc/modprobe.d/ during the main boot. To make it work at the very start, you must update the initramfs:

sudo update-initramfs -u
sudo reboot

Disable Unwanted and Automated Services

To prevent background package updates, firmware refresh jobs, and automated MOTD tasks from interfering with lab automation or controlled upgrade cycles, disable and mask them:

sudo systemctl stop unattended-upgrades.service \
  apt-daily-upgrade.timer apt-daily.timer \
  fwupd-refresh.timer motd-news.timer \
  update-notifier-download.timer update-notifier-motd.timer

sudo systemctl disable unattended-upgrades.service \
  apt-daily-upgrade.timer apt-daily.timer \
  fwupd-refresh.timer motd-news.timer \
  update-notifier-download.timer update-notifier-motd.timer

sudo systemctl mask unattended-upgrades.service \
  apt-daily-upgrade.timer apt-daily.timer \
  fwupd-refresh.timer motd-news.timer \
  update-notifier-download.timer update-notifier-motd.timer

Configuring udev Rules

### Step 1: Extract Partition UUIDs

For each partition in the VM storage RAID, extract the ID_PART_ENTRY_UUID value using udevadm:

udevadm info --query=all --name=/dev/sdb1 | grep "ID_PART_ENTRY_UUID"
udevadm info --query=all --name=/dev/sdb2 | grep "ID_PART_ENTRY_UUID"
udevadm info --query=all --name=/dev/sdb3 | grep "ID_PART_ENTRY_UUID"
udevadm info --query=all --name=/dev/sdb4 | grep "ID_PART_ENTRY_UUID"
udevadm info --query=all --name=/dev/sdb5 | grep "ID_PART_ENTRY_UUID"
udevadm info --query=all --name=/dev/sdb6 | grep "ID_PART_ENTRY_UUID"

Sample output (one per partition):

E: ID_PART_ENTRY_UUID=7e50d064-c0d1-2f48-bc2b-c22bf8a98933
E: ID_PART_ENTRY_UUID=d7276c4a-c944-8249-95fc-5f5845ad8d9e
E: ID_PART_ENTRY_UUID=f8f0bac9-34d0-5a46-b6c0-80a7bed863a1
E: ID_PART_ENTRY_UUID=9aa79a3f-e05a-1c4b-bb81-acd8712e2f8b
E: ID_PART_ENTRY_UUID=b1016c48-2878-3d4d-9933-c3f103450c06
E: ID_PART_ENTRY_UUID=a483e696-afd5-3d43-a8aa-690f125ac70e

Each UUID uniquely identifies a partition and remains stable across reboots.

### Step 2: Create Custom udev Rules

Create the rules file:

/etc/udev/rules.d/99-kvm-storage.rules

Add the following mappings to create stable, human-readable device names:

SUBSYSTEM=="block", ENV{ID_PART_ENTRY_UUID}=="7e50d064-c0d1-2f48-bc2b-c22bf8a98933", SYMLINK+="mirror"
SUBSYSTEM=="block", ENV{ID_PART_ENTRY_UUID}=="d7276c4a-c944-8249-95fc-5f5845ad8d9e", SYMLINK+="mdb"
SUBSYSTEM=="block", ENV{ID_PART_ENTRY_UUID}=="f8f0bac9-34d0-5a46-b6c0-80a7bed863a1", SYMLINK+="git"
SUBSYSTEM=="block", ENV{ID_PART_ENTRY_UUID}=="9aa79a3f-e05a-1c4b-bb81-acd8712e2f8b", SYMLINK+="dcm"
SUBSYSTEM=="block", ENV{ID_PART_ENTRY_UUID}=="b1016c48-2878-3d4d-9933-c3f103450c06", SYMLINK+="web"
SUBSYSTEM=="block", ENV{ID_PART_ENTRY_UUID}=="a483e696-afd5-3d43-a8aa-690f125ac70e", SYMLINK+="registry"

This creates persistent symlinks that can be safely used by VM orchestration scripts:

/dev/mirror  
/dev/mdb  
/dev/git  
/dev/dcm  
/dev/web  
/dev/registry

### Step 3: Apply the udev Rules

Apply the new rules and trigger them:

sudo udevadm control --reload-rules
sudo udevadm trigger

You can verify the symlinks with:

ls -l /dev | grep -E "mirror|mdb|git|dcm|web|registry"

Remove Unwanted Packages

Since this server is dedicated to lab management and does not require Canonical’s Ubuntu Pro or Ubuntu Advantage subscription tooling, these packages can be safely removed to reduce noise and background processes:

sudo apt purge ubuntu-advantage-tools ubuntu-pro-client*
sudo apt autoremove --purge -y

This keeps the system lean and prevents unwanted prompts or background checks related to subscription features.

GRUB updates

Disable Transparent Huge Pages (THP) and set CPU governor to performance.

Transparent Huge Pages (THP) can introduce unpredictable latency in database workloads, especially on systems running PostgreSQL, MongoDB, Redis, or other latency-sensitive services. THP tries to automatically merge standard 4 KB pages into 2 MB huge pages, but this background merging/compaction can cause stalls that negatively affect consistent performance. PostgreSQL does not benefit from THP and instead prefers madvise/regular huge pages only when explicitly configured.

For a standalone PostgreSQL server doing Git + AI/ML metadata workloads, disabling THP helps maintain consistent response times, reduce jitter, and avoid CPU stalls caused by THP compaction, especially under heavy write or mixed workloads.

Modern CPUs use “governors” to manage power. The default is usually powersave or schedutil, which dynamically scales the frequency based on load.

The Problem: Scaling takes time (microseconds). For high-speed messaging in RabbitMQ or rapid Spring Boot API calls, the delay it takes for a CPU to “wake up” from 1.2 GHz to 2.3+ GHz causes measurable micro-stutters.
The Solution: The performance governor tells the intel_pstate driver to ignore power savings and stay at the highest P-state.

PCI Pass-through: NVIDIA Tesla L4

1. Why PCI Pass-through?

The primary goal is to use the Tesla L4 at 100% capacity within a Virtual Machine for high-speed document parsing/inference without the complexity of vGPU.

Zero Licensing: PCI Pass-through is native and free, bypassing the need for NVIDIA License System (NLS) servers.
Lean Host: The Host remains a “carrier,” avoiding proprietary drivers that could break during kernel updates.

2. Host Configuration

We isolate the GPU hardware, so the Guest VM can claim it exclusively.

Identify the Hardware

# Locate the card's address and Vendor ID
lspci -nnk | grep -i nvidia

Address: 81:00.0 (The physical PCI slot on the motherboard).
[10de:27b8] Vendor ID of Nvidia Tesla L4

Edit /etc/default/grub and add the isolation parameters:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on iommu=pt kvm.ignore_msrs=1 vfio-pci.ids=10de:27b8 transparent_hugepage=never cpufreq.default_governor=performance module_blacklist=nouveau,nvidiafb,nvidia"

Kernel & Boot Parameters (GRUB)

Parameter	Documentation Note
`intel_iommu=on`	Enables the Intel hardware IOMMU (VT-d) to provide memory isolation between the host and VMs.
`iommu=pt`	Sets IOMMU to “Pass-Through” mode; prevents the host from attempting to manage DMA for devices it doesn’t own, maximizing I/O performance.
`vfio-pci.ids=10de:27b8`	Explicitly instructs the kernel to bind the NVIDIA L4 GPUs to the VFIO-PCI stub driver during the early boot phase.
`transparent_hugepage=never`	Disables THP to eliminate memory compaction “pauses,” which prevents micro-stutters in database and AI inference workloads.
`cpufreq.default_governor=performance`	Overrides the default power-saving scaling; locks all 72 threads at their maximum clock frequency for zero-latency execution.
`module_blacklist=nouveau`	Completely prevents the open-source NVIDIA driver from loading, ensuring it cannot “claim” the L4 hardware before the VM starts.

Setting up a proxy (nginx) to route incoming HTTP/https requests at the firewall to respective VMs based on FQDN. Note: All FQDNs have a local DNS entry on the UTM device

apt install nginx -y

Assumption: Necessary keys/certificates copied at required folders.

chmod 640 /etc/ssl/private/yourdomain.key
chmod 644 /etc/ssl/certs/yourdomain.crt
chmod 644 /etc/ssl/certs/ca_bundle.crt
chown root:www-data /etc/ssl/private/yourdomain.key

Create /etc/nginx/sites-available/yourdomain

# ---------------------------------------------------------
# 1. DEFAULT BLOCKER (Security Catch-All)
# ---------------------------------------------------------
# Drops connections for IPs or unknown domains
server {
    listen 80 default_server;
    listen 443 ssl default_server;
    server_name _;

    ssl_certificate         /etc/ssl/certs/yourdomain.crt;
    ssl_certificate_key     /etc/ssl/private/yourdomain.key;
    ssl_trusted_certificate /etc/ssl/certs/ca_bundle.crt;

    return 444;
}

# ---------------------------------------------------------
# 2. HTTP -> HTTPS REDIRECT
# ---------------------------------------------------------
server {
    listen 80;
    server_name *.datachronicles.net;
    return 301 https://$host$request_uri;
}

# ---------------------------------------------------------
# 3. DYNAMIC DNS PROXY
# ---------------------------------------------------------
server {
    listen 443 ssl;
    server_name *.yourdomain.net;

    # SSL Config
    ssl_certificate         /etc/ssl/certs/datachronicles.crt;
    ssl_certificate_key     /etc/ssl/private/datachronicles.key;
    ssl_trusted_certificate /etc/ssl/certs/ca_bundle.crt;

    # Performance & Security
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_session_cache shared:SSL:10m;
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;

    # --- DNS RESOLVER CONFIG ---
    # Points to your UDM Pro Gateway
    resolver 10.0.0.1 valid=10s;

    location / {
        # DYNAMIC PROXY
        proxy_pass http://$host:80;

        # Standard Headers
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

sudo ln -s /etc/nginx/sites-available/datachronicles /etc/nginx/sites-enabled/
rm /etc/nginx/sites-enabled/default
systemctl restart nginx
systemctl enable nginx