Skip to content

Model & Serving Layer

The model and serving layer decides which model runs inference, where it runs, how requests are routed, and what operational envelope the rest of the stack must live within.

This is not a workflow framework. It determines reasoning capability, cost, latency, context window, tool-calling behavior, data boundaries, availability, and operational burden.

What this layer owns

ConcernWhy it matters
Model capabilityCoding, planning, extraction, classification, tool calling, multilingual quality
Context windowWhether the agent can hold specs, logs, retrieved context, and diffs
LatencyWhether the product feels interactive or batch-oriented
CostWhether the system can scale beyond demos
Data boundaryWhether prompts and outputs can leave your environment
AvailabilityWhether outages are handled through fallback models or queues
Serving operationsWhether your team owns GPU capacity, scaling, upgrades, and monitoring

Provider-hosted vs self-hosted

OptionBest forTradeoff
Hosted frontier modelhardest reasoning, coding, multimodal, tool usecost, latency, data boundary concerns
Hosted smaller modelclassification, extraction, simple routingweaker reasoning
Local/self-hosted modeldata control, offline, cost predictabilityops burden, weaker frontier capability
Model routermixed workload optimizationneeds policy and observability

Provider-hosted models are usually the fastest way to get high capability. Self-hosted models become attractive when data boundaries, predictable high-volume workloads, offline operation, or internal platform control matter more than maximum frontier capability.

Model router pattern

mermaid
flowchart LR
    A[Application or agent harness] --> B[Model router]
    B --> C[Hosted frontier model]
    B --> D[Cheaper hosted model]
    B --> E[Local/self-hosted model]
    E --> F[vLLM / Ollama / TGI]
    B --> G[Cost, latency, safety, context policy]

A model router is useful when different tasks need different models:

TaskGood routing choice
Architecture reasoningstrongest reasoning model
Code generationstrong coding model
Classificationcheaper fast model
Embeddingembedding-specific model
Sensitive local summarizationlocal or private model
Bulk extractioncheaper model with structured-output checks

The router should not be just a switch statement. It should log decisions, enforce data policy, track cost, and allow evaluation per route.

Local LLM pattern

Local models are valuable when:

  1. The data cannot leave a controlled environment.
  2. The workload is repetitive enough to justify infrastructure.
  3. The task does not require frontier-level reasoning.
  4. The team can operate GPUs, model updates, quantization, and serving metrics.

Local serving is not automatically cheaper. GPU utilization, maintenance, latency tuning, evaluation, and incident response must be counted.

Decision matrix

ConstraintPrefer
Highest reasoning qualityHosted frontier model
Strict data boundaryLocal/self-hosted or private managed deployment
Many simple high-volume callsSmaller hosted model or local batch serving
Multiple teams with mixed use casesLiteLLM-style gateway/router
Need offline operationLocal/self-hosted
Need fast experimentationHosted model first
Regulated enterprise deploymentRouter plus audit, retention, and approval policy

Step-by-step adoption guide

  1. List workloads: coding, support, RAG answering, extraction, classification, summarization, embeddings.
  2. Label each workload by risk: public, internal, confidential, regulated.
  3. Define required model capabilities: context length, tool calling, structured output, multilingual support, latency.
  4. Start with one hosted model for the hardest path and one cheaper model for simple paths.
  5. Add a model router only when you have at least two real routing policies.
  6. Add self-hosted serving only after evals prove the local model is good enough.
  7. Log route, model, token count, latency, failure, and cost for every call.
  8. Add fallback policy for outage, rate limit, and quality regression.

Failure modes

Failure modeConsequencePrevention
Choosing a weak model for complex codingAgent appears "undisciplined" but the model is underpoweredCapability evals before rollout
No routing policyExpensive models handle trivial tasksCost-aware routing
No data classificationConfidential prompts go to the wrong providerData boundary policy
Self-hosting too earlyPlatform team becomes GPU operations teamProve need with cost and data analysis
No fallbackProvider outage becomes product outageFallback and queue policy
No telemetryCost and latency problems are invisibleObservability per model route

References

Built as a static bilingual AI engineering stack guide.