The objective is to start the VLLM service based on QWEN3-VL-8B-FP8; this will be used initially for AI-assisted purchase order processing.
Rationale for model selection: NVIDIA L4 GPU (One number) available in the server.
Download model
hf auth login
hf download Qwen/Qwen3-VL-8B-Instruct-FP8 --local-dir /opt/models/Qwen3-VL-8B-FP8
Create a systemd unit file to start the service (/etc/systemd/system/vllm.service)
[Unit]
Description=vLLM Qwen 3 8B-VL FP8 Service
After=network.target nvidia-persistenced.service
[Service]
Type=simple
User=root
Environment="CUDA_VISIBLE_DEVICES=0"
# Move model path to positional argument (first argument after 'serve')
# Correct the JSON format for --limit-mm-per-prompt
ExecStart=/opt/ai-env/bin/vllm serve /opt/models/Qwen3-VL-8B-FP8 \
--quantization fp8 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--limit-mm-per-prompt '{"image":2}' \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--port 8001
Restart=always
RestartSec=10
StandardOutput=append:/var/log/ai.log
StandardError=append:/var/log/ai.log
[Install]
WantedBy=multi-user.target