AI Solution Architecture

Docs

View source

Runtime Decision Matrix

Use this template when choosing how to serve a model.

Candidate Runtimes

RuntimeBest FitWatch For
Hosted APIFast product validation and managed operationsProvider lock-in, cost, data policy, rate limits
TransformersCompatibility, experimentation, model API baselineServing efficiency, memory, throughput
vLLMHigh-throughput GPU serving and OpenAI-compatible APIsGPU memory, scheduler behavior, model support
llama.cppLocal, edge, CPU/GPU hybrid, quantized inferenceQuantization quality, conversion, context limits
Hybrid GatewayMulti-provider routing and fallbackRouting policy, observability, consistency

Requirements

RequirementTargetHard / SoftNotes
p50 latency
p95 latency
Concurrent users
Tokens per second
Context length
Streaming
Deployment target
Cost ceiling
Data policy

Evaluation Matrix

CriterionWeightHosted APITransformersvLLMllama.cppNotes
Latency
Throughput
Memory efficiency
Model compatibility
Operational simplicity
Observability
Security posture
Cost

Serving Architecture

flowchart LR Client[Application or agent] --> Gateway[Provider/runtime gateway] Gateway --> Admission[Admission control] Admission --> Scheduler[Batching and scheduling] Scheduler --> Model[Model runtime] Model --> Stream[Streaming response] Gateway --> Metrics[Metrics, traces, logs] Metrics --> Gate[Release and rollback gate]

Promotion Gate

Scoring Guidance

Score each runtime against the real workload rather than against generic popularity. A hosted API can be the best decision for early product validation when managed reliability and fast iteration matter more than low-level control. vLLM can be the best decision when GPU throughput, continuous batching, and OpenAI-compatible serving are important. Transformers can be the best baseline for experimentation and model compatibility. llama.cpp can be strong for local, edge, constrained, or quantized deployments. A hybrid gateway can be valuable when the system needs routing, fallback, policy enforcement, or provider abstraction.

Weights should reflect the product stage. During discovery, operational simplicity and iteration speed may dominate. During scale-up, throughput, cost per token, latency distribution, and quota planning become more important. For regulated systems, data policy, auditability, provider contract terms, and deployment residency can outweigh raw speed.

Benchmark Plan

Run benchmarks with production-shaped prompts. Include short requests, long-context requests, retrieval-augmented prompts, streaming responses, tool-call prompts, and worst-case completion lengths. Capture cold start behavior, steady-state throughput, memory pressure, queue time, token generation rate, error rate, and retry behavior. If the runtime is GPU-based, record GPU model, memory size, tensor parallel settings, batch configuration, context length, quantization format, and scheduler parameters.

Do not promote a runtime from synthetic throughput alone. A serving decision is only defensible when quality, latency, cost, observability, security, and rollback have all been checked. The winner should be the runtime that best satisfies the operating envelope, not the runtime with the most impressive isolated benchmark.