AI Assisted Purchase Order Processing – Implementing Inference Service

In the previous architectural deep dives, we defined what we are building: a Purchase Order processing system powered by Vision-Language Models (VLMs). Now, we shift gears to how we build it.

This post serves as the engineering log for setting up the core AI Inference Service. We will configure the environment, deploy the Qwen2-VL-7B-Instruct model using vLLM, and wrap it in a lightweight FastAPI middleware to handle business logic.

1. System Preparation

Before we touch the Python stack, we need to ensure our Linux environment is ready to handle PDF rendering and GPU operations.

# Update package lists

sudo apt-get update

# Install system-level dependencies:
# - poppler-utils: Required by pdf2image to render PDFs into images.
# - libgl1: Required by OpenCV/Pillow for image manipulation.

sudo apt-get install -y poppler-utils libgl1

2. Python Environment Setup

We install a targeted set of libraries. We are avoiding the bloat of heavyweight frameworks like LangChain, preferring direct control over model interactions.

# Core AI Serving Engine

pip install vllm

# API & Utilities
# - fastapi/uvicorn: For building the REST API
# - pdf2image/pillow: For converting PDF pages to visual inputs
# - openai: Used as the standard client to talk to vLLM
# - pydantic: For strict schema validation of the extracted JSON

pip install fastapi "uvicorn[standard]" python-multipart pdf2image pillow openai pydantic

3. Model Acquisition (NFS Strategy)

Since our architecture uses a persistent NFS mount (/po), We download the model weights there. This ensures that if we redeploy the container or VM, we don’t need to re-download 15GB of weights.

Why Qwen2-VL? As discussed in the architecture post, this model natively understands visual layouts, eliminating the need for a separate OCR step.

# Create the persistent model directory
mkdir -p /po/models

# Download the model explicitly
# We use symlinks=False to ensure the actual physical files reside on the NFS share
huggingface-cli download Qwen/Qwen2-VL-7B-Instruct \
    --local-dir /po/models/Qwen2-VL-7B-Instruct \
    --local-dir-use-symlinks False

4. Service 1: The AI Engine (vLLM)

We treat the Large Language Model as a foundational infrastructure service, managed by systemd.

Key Configuration Decisions:

  1. --max-model-len 8192: Purchase orders are visually dense. A limit of 8,192 tokens balances the need to process full pages of line items without truncation against the hardware’s VRAM limits.
  2. --gpu-memory-utilization 0.9: We allocate 90% of the GPU memory to the model weights and KV cache.
    • Why not 95%? On a 24GB card (NVIDIA L4), leaving a 10% buffer (~2.4GB) is critical. It prevents Out-Of-Memory (OOM) crashes during spikes in activation memory (which happen when processing complex, full-context prompts) and leaves room for system overhead.

Create File: /etc/systemd/system/vllm.service

[Unit]
Description=vLLM Inference Service (Qwen2-VL)
# Critical: Wait for IP assignment before starting, as vLLM binds to a network port.
After=network-online.target
Wants=network-online.target

[Service]
User=root
Group=root
WorkingDirectory=/po
Environment="PYTHONUNBUFFERED=1"

# The Startup Command
# --served-model-name: Sets the identifier we will use in the API client
# --trust-remote-code: Required for Qwen's specific architecture
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
    --model /po/models/Qwen2-VL-7B-Instruct \
    --served-model-name Qwen/Qwen2-VL-7B-Instruct \
    --port 8001 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --trust-remote-code

Restart=always
RestartSec=10
LimitNOFILE=65535

[Install]
WantedBy=multi-user.target

Deploy the Service:

sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm

5. Service 2: The Middleware (FastAPI)

While vLLM provides the raw intelligence, it doesn’t know about PDFs or our specific JSON schema. We build a lightweight Middleware Service to bridge this gap.

This service performs three critical tasks:

  1. Rendering: Converts the PDF into high-res images.
  2. Prompt Engineering: Wraps the image in a strict System Prompt.
  3. Guardrails: Uses Pydantic to enforce data types (e.g., ensuring “Price” is a float, not a string).

mkdir -p /po/api

Create File: /po/api/main.py [ DM me on LinkedIn for the source – it is just a single py file ]

Deploy the Middleware:

We configure systemd to strictly order services: the API starts only after vLLM is healthy.

Create File: /etc/systemd/system/po-api.service

[Unit]
Description=PO AI Middleware Service (FastAPI)
After=network-online.target vllm.service
Wants=network-online.target

[Service]
User=root
Group=root
WorkingDirectory=/po/api
Environment="PYTHONUNBUFFERED=1"
ExecStart=/usr/bin/python3 /po/api/main.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable the Chain:

sudo systemctl daemon-reload
sudo systemctl restart vllm  # Ensures network dependency is applied
sudo systemctl restart po-api

6. The Final Test

With the stack running, we perform an end-to-end validation. We place a sample PDF in our monitored folder and trigger the extraction manually via curl.

# 1. Place a test file (Ensure you have a PDF named 'test_invoice.pdf' in /po)
# You can drag-and-drop a file if using an SFTP client, or copy one:

cp /path/to/your/local/invoice.pdf /po/test_invoice.pdf

# 2. Trigger the extraction

curl -X POST "http://localhost:8000/extract" \
     -H "Content-Type: application/json" \
     -d '{"filename": "test_invoice.pdf"}'

Success Criteria: If the system is healthy, you will receive a JSON response containing structured data (vendor_name, grand_total, etc.) regardless of how messy the original PDF layout was. This confirms that our “Vision-First” extraction strategy is working.