Custom Cloud Image for AI Workloads

This document describes the process for creating a reusable, deterministic Ubuntu 22.04 QCOW2 base image optimized for GPU-accelerated AI workloads. The resulting image is intended to be cloned and customized by automation/orchestration tooling.


1. Base OS Installation

  • Operating System: Ubuntu Server 22.04 LTS
  • Kernel: GA kernel (5.15) — avoid HWE
  • Disk Size: 50 GB (future-proof; base image only)

Why avoid the HWE kernel in the base image

  • GA kernel (5.15) provides maximum stability and predictable behavior.
  • Reduces DKMS rebuild failures during NVIDIA driver installation.
  • Avoids kernel churn across clones.
  • HWE (6.x) can be selectively enabled later if required, but should not be baked into the golden image.

2. System Update

apt update -y
apt upgrade -y

3. SSH Configuration (Remote Root Access)

Update SSH daemon configuration

Edit /etc/ssh/sshd_config and ensure the following are enabled:

PermitRootLogin yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys .ssh/authorized_keys2

Update SSH client configuration

Edit /etc/ssh/ssh_config:

StrictHostKeyChecking no

Set root password

passwd

Generate SSH key (optional, for automation)

ssh-keygen

4. Clean Login Noise (MOTD)

Disable MOTD messages

Edit /etc/pam.d/ssh and comment out:

#session optional pam_motd.so motd=/run/motd.dynamic
#session optional pam_motd.so noupdate
#session optional pam_mail.so standard noenv

Disable MOTD news

Edit /etc/default/motd-news:

ENABLED=0

5. Remove Snap and snapd

snap list
snap remove lxd
snap remove core20
snap remove snapd
apt purge --remove snapd
rm -rf /root/snap/

6. Disable Swap (AI-friendly, deterministic memory behavior)

systemctl list-units | grep swap
systemctl stop swap.img.swap swap.target
systemctl disable swap.img.swap swap.target
systemctl mask swap.img.swap swap.target
swapoff -a
rm -f /swap.img

Remove swap entries from /etc/fstab.


7. Deterministic Network Interface Naming

Edit /etc/default/grub:

GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0"

Apply and reboot:

update-grub
reboot

8. Cloud-Init Networking Cleanup

rm -f /etc/cloud/cloud.cfg.d/90-installer-network.cfg

Disable cloud-init networking entirely:

echo "network: {config: disabled}" > /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg

9. Limit Systemd Journal Size

sed -i 's/#SystemMaxFileSize=.*/SystemMaxFileSize=512M/' /etc/systemd/journald.conf

10. systemd-resolved Configuration

Edit /etc/systemd/resolved.conf:

DNS=<Your DNS Server IP>
FallbackDNS=8.8.8.8
Domains=<Your domain>
DNSStubListener=no

Apply:

ln -fs /run/systemd/resolve/resolv.conf /etc/resolv.conf
systemctl restart systemd-resolved

11. Install Base Utilities

apt install -y \
  net-tools rsyslog bc fio iperf3 gnupg2 \
  software-properties-common lvm2 nfs-common jq

12. Disable Unwanted Timers

systemctl list-units | grep timer

Disable:

systemctl stop apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
               motd-news.timer update-notifier-download.timer update-notifier-motd.timer

systemctl disable apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
                  motd-news.timer update-notifier-download.timer update-notifier-motd.timer

systemctl mask apt-daily-upgrade.timer apt-daily.timer fwupd-refresh.timer \
             motd-news.timer update-notifier-download.timer update-notifier-motd.timer

13. Disable Unattended Services

systemctl stop unattended-upgrades.service apparmor ufw ubuntu-advantage-tools
systemctl disable unattended-upgrades.service apparmor ufw ubuntu-advantage-tools
systemctl mask unattended-upgrades.service apparmor ufw ubuntu-advantage-tools

14. Kernel Headers and Toolchain

apt install -y linux-headers-$(uname -r) build-essential dkms gcc make

15. GPU Drivers

Driver selection

  • Use the Ubuntu-recommended NVIDIA driver (via nvidia-detector)
  • Driver version 580 chosen for:
    • NVIDIA L4 compatibility
    • Stability on kernel 5.15
    • Forward CUDA compatibility

Install driver and utilities

apt install -y nvidia-driver-580-server nvidia-utils-580-server

CUDA runtime is intentionally not installed system-wide.


16. Python Runtime

apt install -y python3 python3-pip

17. Add environmental variables

Create directories used by Hugging Face

mkdir -p /var/lib/huggingface/{hub,transformers}

chmod -R 755 /var/lib/huggingface

Add the following to /etc/environment

HF_HOME=/var/lib/huggingface
TRANSFORMERS_CACHE=/var/lib/huggingface/transformers
HF_HUB_CACHE=/var/lib/huggingface/hubHF_HOME=/var/lib/huggingface
TRANSFORMERS_CACHE=/var/lib/huggingface/transformers
HF_HUB_CACHE=/var/lib/huggingface/hub

TOKENIZERS_PARALLELISM=false

18. Core Deep Learning Framework

PyTorch (CUDA 12.4 runtime via wheels)

pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu124

Rationale:

  • PyTorch wheels ship the CUDA runtime internally.
  • Avoids system-level CUDA toolkit dependency.
  • Decouples framework upgrades from the OS image.

19. Scientific Python Stack

pip install numpy scipy pandas

20. Hugging Face Ecosystem (Minimal Core)

pip install transformers tokenizers safetensors

21. GPU-Aware Utilities

pip install accelerate

22. Disk Resize on First Boot

Create /usr/local/bin/resizedisk:

#!/bin/bash
growpart /dev/vda 2
partx --update /dev/vda2
resize2fs /dev/vda2
systemctl stop guestfs-firstboot.service
systemctl disable guestfs-firstboot.service
rm -f /root/*.log

Set permissions:

chmod +x /usr/local/bin/resizedisk

23. Filesystem Optimization and Compaction

e4defrag /
fstrim -av
dd if=/dev/zero of=/zero.fill bs=1M status=progress
rm -f /zero.fill
fstrim -av

24. Final Cleanup and Shutdown

history -c
shutdown -h now

25. Export Base QCOW2 Image

Compress and finalize the base image:

virt-sparsify --compress \
  /var/lib/libvirt/images/ubuntu22.qcow2 \
  /root/kvm-local/ubuntu22/base.qcow2