VLLM Service – Home Lab

The objective is to start the VLLM service based on QWEN3-VL-8B-FP8; this will be used initially for AI-assisted purchase order processing.

Rationale for model selection: NVIDIA L4 GPU (One number) available in the server.

Download model

hf auth login
hf download Qwen/Qwen3-VL-8B-Instruct-FP8 --local-dir /opt/models/Qwen3-VL-8B-FP8

Create a systemd unit file to start the service (/etc/systemd/system/vllm.service)

[Unit]
Description=vLLM Qwen 3 8B-VL FP8 Service
After=network.target nvidia-persistenced.service

[Service]
Type=simple
User=root
Environment="CUDA_VISIBLE_DEVICES=0"
# Move model path to positional argument (first argument after 'serve')
# Correct the JSON format for --limit-mm-per-prompt
ExecStart=/opt/ai-env/bin/vllm serve /opt/models/Qwen3-VL-8B-FP8 \
    --quantization fp8 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --limit-mm-per-prompt '{"image":2}' \
    --gpu-memory-utilization 0.90 \
    --kv-cache-dtype fp8 \
    --port 8001

Restart=always
RestartSec=10
StandardOutput=append:/var/log/ai.log
StandardError=append:/var/log/ai.log

[Install]
WantedBy=multi-user.target

3 thoughts on “VLLM Service”