AI-Assisted POP: Technology Stack, Services, and Model Choices – Home Lab

In the previous post, we outlined the solution approach and high-level architecture. That discussion focused on the “what” and the “why.” Now, we turn our attention to the “how.”

This post details the technology stack, service boundaries, and specific AI model choices that power our AI-assisted Purchase Order Processing system. We have selected a stack that utilizes modern Vision-Language models to create a system that is flexible, scalable, and highly capable of programmatic reasoning.

Design Imperatives

Our technology choices are driven by four strategic imperatives designed to unlock the full potential of AI automation:

Deterministic Reliability: The system must behave predictably. AI may deal in probabilities, but the business process must deal in certainties.
Strategic Separation of Concerns: We decouple the “Business Logic” (Rules, workflows, state) from the “AI Execution” (Inference, vectors, tensors) to allow each to scale independently.
Active Learning Capability: The stack must support a seamless feedback loop where human corrections are automatically converted into training data.
Operational Agility: We prioritize tools that enable rapid iteration and deployment without upfront overhead from complex orchestration platforms.

Reference Runtime Environment

To ensure reproducibility and adequate performance for the AI workload, this PoC was validated on the following infrastructure:

Operating System: Ubuntu 22.04 LTS
GPU Hardware: NVIDIA Tesla L4 (24GB VRAM) – Chosen for its optimal balance of inference performance and cost.
Drivers: NVIDIA Driver 580.95.05+ (Supports CUDA up to 13.0; PoC tested with CUDA 12.x).
Runtime: Python 3.10+ for AI services; JDK 17 for Spring Boot.

The Hybrid “Best-of-Breed” Architecture

To achieve our goals, we employ a hybrid architecture that leverages the specific strengths of two dominant ecosystems: Java for enterprise stability and Python for AI innovation.

1. The Control Plane (Java + Spring Boot)

The orchestration, validation, and human interaction layers are built on Spring Boot.

Why: Java provides the type safety, concurrency management, and rich ecosystem needed for long-running business processes. It acts as the “Traffic Controller,” ensuring that documents flow correctly, state is preserved, and human reviews are logged in an auditable manner.

2. The AI Execution Plane (Python + vLLM)

A Python-based serving layer handles the heavy lifting of inference.

Why: Python is the native language of modern AI. By exposing our models via standard APIs, we create a seamless bridge between the raw power of the GPU and the structured world of the enterprise app.

Core Services Overview

The system is composed of five specialized microservices:

Order Watcher (Spring Boot): The “eyes” of the system. It monitors input channels (e.g., FTP or Email), detects stable files, and initiates the ingestion event.
Order Processor (Spring Boot): The workflow engine. It consumes events, orchestrates calls to the AI service, applies business validation rules, and decides whether to route to ERP or Human Review.
Review & Retrain Controller (Spring Boot): The user interface backend. It serves the UI for human operators to review anomalies and, crucially, captures their corrections to trigger the retraining loop.
AI Inference Service (vLLM): The intelligence hub. It accepts raw document images and returns structured, normalized JSON.
Model Retraining Service (Python): The learning center. It consumes corrected data to fine-tune the Vision-Language model, improving its instruction-following capabilities for future runs.

The AI Engine: Qwen2-VL and vLLM

For the critical task of extracting data from complex purchase orders, we have selected Qwen2-VL-7B-Instruct, served via vLLM.

The Model: Qwen2-VL-7B-Instruct

We chose this specific Vision-Language Model (VLM) because it is capable of “reading” document images natively.

Visual Reasoning: Unlike text-only models, Qwen2-VL processes the PDF’s visual layout. It understands that a number in a grid cell is a line-item price, and that a date in the top-right cell is likely the PO date.
Structured Output: The model demonstrates exceptional adherence to instructions. When prompted to extract data into a specific JSON schema, it reliably produces valid, parseable JSON without requiring complex post-processing scripts.

The Serving Layer: vLLM

To run this model efficiently in production, we utilize the vLLM library.

Why vLLM? It provides high-throughput serving with state-of-the-art memory management (PagedAttention).
Standard Interface: We launch the model using the OpenAI-compatible API server. This allows our Spring Boot application to communicate with the AI model via standard HTTP clients, treating the advanced AI engine as just another RESTful web service.

Infrastructure & Data Backbone

To support this high-performance stack, we rely on battle-tested infrastructure components:

Messaging (RabbitMQ): Decouples our services, ensuring that a spike in document volume doesn’t cause the system to crash. It guarantees that every PO is processed in order and nothing is lost.
Persistence (MariaDB): Serves as the single source of truth. We use native JSON support to store the complex hierarchical data of purchase orders while maintaining the relational integrity needed for audit logs.
Compute (NVIDIA L4): The system is powered by NVIDIA L4 GPUs. The Qwen2-VL-7B model fits comfortably in the L4’s VRAM, providing high-throughput inference at a fraction of the cost of larger enterprise GPUs.

The “Learning Loop” Advantage

The true power of this stack lies in how these components interact to create a self-improving system.

When a human operator corrects a document in the Spring Boot application, that data isn’t just saved—it is transformed. The Retraining Service picks up the corrected JSON and the original document image to create a new “visual instruction” pair.

This means the system doesn’t just “process” orders; it learns from them. Every exception handled by a human today becomes a training example that teaches the VLM how to handle that specific vendor or edge case tomorrow. This architecture turns operational friction into a competitive advantage.