Skip to content

Home Lab

Notes from my learning sessions

Menu
Menu

Ceph + KVM : 5. Service checks and CLI commands

Posted on September 22, 2024April 20, 2025 by sandeep

Previous: Orchestrating Ceph RBD-backed VMs on KVM

We need to have all ceph services in active state along with libvirtd services.  I had observed that on reboot of servers, ceph-mon and ceph-mgr services started and immediately got deactivated, need to review why it happens.

Even though we had put an “ExecStartPost” hook to the libvirtd service, at times, we observed that pools were not active as dependent ceph services were down.

To overcome these issues, until we fix the service dependencies, a cron job has been added to be executed every two minutes, which is nothing but the following shell script.

#!/bin/bash
export PATH=/usr/bin:$PATH
logfile=/var/log/cephcs.log

function log() {
  timenow=$(date)
  line_count=$(wc -l < "$logfile")
  if (( line_count > 500 )); then
    tail -n 500 "$logfile" > "$logfile.tmp"
    mv "$logfile.tmp" "$logfile"
  fi
  echo "$timenow | $1" >> "$logfile"
}

function waitForRBDPool() {
  maxWaitCount=5
  counter=0
  poolName=$1
  poolCount=`ceph osd pool ls | grep "$poolName" | wc -l`
  if [ $poolCount -le 0 ]; then
    while true; do
      ((counter++))
      if [ $counter -gt $maxWaitCount ]; then
        break
      else
        sleep 3
        poolCount=`ceph osd pool ls | grep "$poolName" | wc -l`
        if [ $poolCount -ge 1 ]; then
          break
        fi
      fi
    done
  fi
  if [ $counter -le $maxWaitCount ]; then
    log "CEPH : $poolName is ready."
  else
    log "CEPH : $poolName is not ready."
  fi
}

function checkVirshPool() {
  poolcount=`virsh pool-list | grep "$1" | wc -l`
  if [ $poolcount -lt 1 ]; then
    log "Virsh : $1 : Starting pool"
    virsh pool-start $1
    virsh pool-autostart $1
    sleep 3
    poolcount=`virsh pool-list | grep "$1" | wc -l`
  fi
  if [ $poolcount -eq 1 ]; then
    log "Virsh : $1 : active."
    else
    log "Virsh : $1 : not active."
  fi
}

function checkCephServices() {
  systemctl list-dependencies ceph.target | grep -oE 'ceph-[a-z]+@[a-z0-9.]+|ceph-[a-z]+.service' | sort -u > /usr/bin/ceph.services
  for service in $(cat /usr/bin/ceph.services); do
    if ! systemctl is-active "$service" > /dev/null; then
      log "Starting service $service"
      systemctl start $service
      sleep 3
    else
      log "$service is active."
    fi
  done
}

function checkLibVirtdServices() {
  if ! systemctl is-active libvirtd.service > /dev/null; then
    log "Libvirt is not active .... checking ceph services..."
    for service in $(cat /usr/bin/ceph.services); do
      if ! systemctl is-active "$service" > /dev/null; then
        log "Service $service is not active!"
        all_services_active=false
        break
      fi
    done

    if $all_services_active; then
      log "Starting libvirtd..."
      systemctl start libvirtd.service
    else
      log "Not all ceph services are up. Not starting libvirtd..."
    fi
  else
    log "Libvirtd service is active."
  fi
}

function checkActivePools() {
  if systemctl is-active libvirtd.service > /dev/null; then
    log "Libvitd is active... Checking for active pools...."

    secretid=$(awk '/<uuid>/,/<\/uuid>/ { if ($1 ~ /<uuid>/) { gsub(/<\/?uuid>/,""); print $1 } }' /root/kvm/ceph-secret.xml)
    secretvalue=$(awk '/\[client\.libvirt\]/,/^$/ { if ($1 == "key") { print $3 } }' /etc/ceph/ceph.client.libvirt.keyring)
    configuredValue=`virsh secret-get-value $secretid`

    if [ "$secretvalue" != "$configuredValue" ]; then
      log "Setting secret value..."
      virsh secret-set-value --secret "$secretid" --base64 "$secretvalue"
    fi
    waitForRBDPool ssdpool
    waitForRBDPool nvmepool

    checkVirshPool ssdpool
    checkVirshPool nvmepool
  else
    log "Libvirtd is not active, not checking for active pools."
  fi
}

log "-------- Periodic check start ---------------"
checkCephServices
checkLibVirtdServices
checkActivePools
log "-------- Periodic check end -----------------"
exit 0

Notes on some cli commands

Listing images/blocks in a pool and removing

root@server1:~# ceph osd lspools
1 .mgr
2 rbdpool
root@server1:~# rbd ls rbdpool
testblockdevice
root@server1:~# rbd rm rbdpool/testblockdevice
Removing image: 100% complete...done.
root@server1:~# ceph osd pool rm rbdpool rbdpool --yes-i-really-really-mean-it
pool 'rbdpool' removed
root@server1:~#

Removing the monitor node (from any existing monitor node)

NODETOREMOVE=storage4
OTHMON1=storage2
OTHMON2=storage3
ceph mon remove $NODETOREMOVE
monmaptool /etc/ceph/monmap --rm $NODETOREMOVE
scp /etc/ceph/monmap $OTHMON1:/etc/ceph/monmap
scp /etc/ceph/monmap $OTHMON2:/etc/ceph/monmap
ssh $OTHMON1 "chown ceph:ceph /etc/ceph/monmap"
ssh $OTHMON1 "systemctl stop ceph-mon.target"
ssh $OTHMON1 "systemctl start ceph-mon.target"
ssh $OTHMON2 "chown ceph:ceph /etc/ceph/monmap"
ssh $OTHMON2 "systemctl stop ceph-mon.target"
ssh $OTHMON2 "systemctl start ceph-mon.target"

Removing the manager node (from any server)

NODETOREMOVE=storage4
ceph mgr fail $NODENAME
ssh $NODETOREMOVE "systemctl stop ceph-mgr.target"
ssh $NODETOREMOVE "systemctl disable ceph-mgr.target"

Changing replication factor

ceph osd pool set nvmepool size 2

Mounting block device, unmounting and deleting image

rbd create --size 10G --pool ssdpool ssdbd
rbd map ssdbd --pool ssdpool
mkfs.ext4 /dev/rbd0
mkdir /root/test
mount /dev/rbd0 /root/test
rbd unmap ssdpool/ssdbd
rbd rm ssdpool/ssdbd

Get crush map

ceph osd getcrushmap -o crushmap.bin
crushtool -d crushmap.bin -o crushmap.txt

Set crush map

crushtool -c crushmap.txt -o crushmap-new.bin
ceph osd setcrushmap -i crushmap-new.bin

Set PGs per OSD limit

ceph osd pool set <poolname> pg_num <newvalue>

It is recommended to understand the following configurations

mon_osd_min_in_ratio
osd_max_pg_per_osd_hard_ratio

Removing OSD from the cluster
  • Identify the OSD (ceph osd tree)
  • Mark the OSD as ‘out’ (ceph osd out <osd_id>)
  • Wait for Rebalancing to Complete (ceph -w)
  • Remove the OSD from the CRUSH Map (ceph osd crush remove <osd_id>) – May not be required
  • Stop the OSD Service (systemctl stop ceph-osd@<osd_id>.service)
  • Purge the OSD (ceph ods purge <osd_id> –force
  • Remove OSD Data, lvdisplay, lvremove, vgdisplay, vgremove, pvdisplay pvremove, remove the partition using fdisk

Clearing warning messages against which action was taken

root@server2:~# ceph -s
cluster:
id: 577c09c2-c514-471a-aee1-6a0f56c83c3a
health: HEALTH_WARN
9 mgr modules have recently crashed
OSD count 0 < osd_pool_default_size 3

services:
mon: 3 daemons, quorum ceph1,ceph3,ceph2 (age 103m)
mgr: ceph2(active, since 4m), standbys: ceph1, ceph4, ceph3
osd: 0 osds: 0 up, 0 in

data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:

root@server2:~# ceph crash ls
ID ENTITY NEW
2024-10-13T15:41:19.164944Z_d04402ee-d3a5-4173-9743-0a11c554cc85 mgr.ceph2 *
2024-10-13T15:41:47.715119Z_879cff77-9204-4631-a89d-012a12da7e63 mgr.ceph4 *
2024-10-13T15:43:35.282384Z_a3a64dc0-4ecf-40c2-a194-b5679d271630 mgr.ceph1 *
2024-10-13T15:43:47.108238Z_cfbed953-a2a6-4745-a671-b5c9ff113fa7 mgr.ceph3 *
2024-10-13T15:47:03.374137Z_0d312c39-2bc8-4e92-b8c4-2ac129bfb352 mgr.ceph2 *
2024-10-13T17:02:37.284921Z_d1e218d8-cf7a-4a98-b235-bfc3774a59ef mgr.ceph3 *
2024-10-13T17:05:25.774249Z_4ce30ade-db1c-4668-8109-f94416b5b202 mgr.ceph3 *
2024-10-13T17:15:31.760921Z_2f8dc8a2-368d-408c-9a73-e0d7727307f8 mgr.ceph2 *
2024-10-13T17:26:26.946611Z_bd502a8a-5953-4246-b1f9-df7dbcb178ee mgr.ceph2 *
root@server2:~# ceph crash rm 2024-10-13T15:41:19.164944Z_d04402ee-d3a5-4173-9743-0a11c554cc85
root@server2:~# ceph crash rm 2024-10-13T15:41:47.715119Z_879cff77-9204-4631-a89d-012a12da7e63
root@server2:~# ceph crash rm 2024-10-13T15:43:35.282384Z_a3a64dc0-4ecf-40c2-a194-b5679d271630
root@server2:~# ceph crash rm 2024-10-13T15:43:47.108238Z_cfbed953-a2a6-4745-a671-b5c9ff113fa7
root@server2:~# ceph crash rm 2024-10-13T15:47:03.374137Z_0d312c39-2bc8-4e92-b8c4-2ac129bfb352
root@server2:~# ceph crash rm 2024-10-13T17:02:37.284921Z_d1e218d8-cf7a-4a98-b235-bfc3774a59ef
root@server2:~# ceph crash rm 2024-10-13T17:05:25.774249Z_4ce30ade-db1c-4668-8109-f94416b5b202
root@server2:~# ceph crash rm 2024-10-13T17:15:31.760921Z_2f8dc8a2-368d-408c-9a73-e0d7727307f8
root@server2:~# ceph crash rm 2024-10-13T17:26:26.946611Z_bd502a8a-5953-4246-b1f9-df7dbcb178ee
root@server2:~# ceph -s
cluster:
id: 577c09c2-c514-471a-aee1-6a0f56c83c3a
health: HEALTH_WARN
OSD count 0 < osd_pool_default_size 3

services:
mon: 3 daemons, quorum ceph1,ceph3,ceph2 (age 106m)
mgr: ceph2(active, since 7m), standbys: ceph1, ceph4, ceph3
osd: 0 osds: 0 up, 0 in

data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:

RBD pool-related commands
rbd ls <poolname> - List all images in the pool
rbd rm <blockdevice> -p <poolname> - Remove image in pool

Triggering install of Os from ISO, (UEFI)

virt-install --name=debian12 --ram=8192 --vcpus=8 --disk path=/var/lib/libvirt/images/debian12.qcow2,size=5 --cdrom=/var/lib/libvirt/images/debian12.iso --network bridge=br1 --graphics vnc --os-type=linux --os-variant=debian11 --boot uefi,hd

 

 

 

Recent Posts

  • Ceph + KVM: 4. Orchestrating Ceph RBD backed VMs on KVM Hosts
  • Rabbit MQ Cluster + HAProxy + Keepalived
  • Install and configure MariaDB / Galera cluster
  • Ceph + KVM : 3. Installing KVM, cutsomized monitoring scripts
  • Ceph + KVM : 5. Service checks and CLI commands
  • Ceph + KVM : 2. Installation – Ceph Storage
  • Ceph + KVM : 1. Planning and preparing for Ceph Storage
  • Openstack Xena on Ubuntu 20.04 – Cinder
  • Preparing custom Debian 11 MATE image
  • Setup Ubuntu 20.04 repository mirror server

Archives

  • April 2025
  • March 2025
  • October 2024
  • September 2024
  • April 2022
  • March 2022
  • February 2022
  • December 2021
  • October 2021
  • September 2021
  • October 2020
  • February 2020
  • January 2020
  • December 2019
© 2025 Home Lab | Powered by Minimalist Blog WordPress Theme