AI Solution Architecture

Deep Dives

View source

llama.cpp Architecture

Source snapshot: github-repos/02-model-serving-inference/llama.cpp at bfb4308 (model : support granite multilingual embeddings R2 ... (#22716), tag b9481). This document is grounded in the repository files present in that snapshot.

Executive Summary

llama.cpp is a native C/C++ inference stack for running large language models with minimal setup across local machines, edge devices, and cloud instances. The README summarizes the goal as "LLM inference in C/C++" and emphasizes low dependency count, Apple Silicon support, CPU SIMD paths, many quantization formats, CUDA/HIP/Vulkan/SYCL/OpenCL/Metal and other backends, and CPU/GPU hybrid inference.

The core product is the llama library, whose public C API is in include/llama.h and C++ RAII helpers are in include/llama-cpp.h. Around that core, the repository includes model conversion to GGUF, quantization tools, a command-line client, an OpenAI-compatible HTTP server, benchmarking tools, multimodal support, tests, and the ggml tensor/runtime backend library.

For solution architects, llama.cpp is most valuable when the deployment target needs portability, small operational footprint, local or edge execution, strong quantized-model support, or native integration. It can run as a local CLI, embedded library, Docker container, web server, mobile demo, or multi-GPU backend-dependent service. Its tradeoff is that model support, runtime features, and hardware acceleration are tightly coupled to GGUF metadata, graph construction code, backend kernels, and build flags.

Problem Solved

Model serving often assumes a Python runtime, GPU-heavy deployment, and a managed serving stack. llama.cpp addresses a different operating model:

Repository anchors include include/llama.h, src/llama.cpp, src/llama-context.cpp, src/llama-model.cpp, src/llama-kv-cache.cpp, tools/server/*, tools/cli/*, tools/quantize/*, convert_hf_to_gguf.py, gguf-py/gguf/*, ggml/src/*, and tests/*.

AI Stack Role

llama.cpp is an inference runtime and model-serving toolkit. It does not primarily define models for training like Transformers, and it is not a Python-first high-throughput server like vLLM. Its role is closer to a portable native execution layer:

Source Tree Map

PathRole
README.mdProject overview, quick start, supported model families, hardware support, install/build links.
include/llama.hPublic C API: model/context params, backend init/free, model loading, llama_batch, llama_encode, llama_decode, samplers, embeddings, KV operations.
include/llama-cpp.hC++ RAII helpers for llama_model, llama_context, and llama_sampler.
src/llama.cppCore exported implementation glue for public API.
src/llama-model.cpp, src/llama-model.hModel representation, architecture-specific graph construction, RoPE/model logic.
src/llama-model-loader.cppGGUF model loading and metadata/tensor handling.
src/llama-context.cppRuntime context, decode/encode execution, output extraction, embeddings, logits, performance behavior.
src/llama-batch.cppBatch allocation and splitting into micro-batches.
src/llama-kv-cache*.cpp, src/llama-memory*.cppKV-cache and memory implementations, including hybrid and recurrent paths.
src/llama-sampler.cppSampling chain and token selection.
src/llama-vocab.cpp, src/unicode*.cppTokenization, vocabulary, Unicode handling.
src/llama-adapter.cppLoRA/control-vector adapter loading and backend buffer handling.
common/*Shared utilities for CLI/tools/server: args, sampling, chat templates, HF cache/download, logging, JSON/grammar helpers, speculative decoding.
tools/server/*HTTP server based on httplib and nlohmann::json; OpenAI-compatible, Anthropic-compatible, web UI, slots, queue, model routing.
tools/cli/*, tools/completion/*Main interactive CLI and completion utilities.
tools/quantize/*, tools/imatrix/*, tools/gguf-split/*Quantization, importance matrix, GGUF splitting, and model tooling.
tools/llama-bench/*, tools/perplexity/*Benchmarking and quality/perplexity utilities.
tools/mtmd/*Multimodal library and model-specific encoder/projector support.
examples/*Minimal library usage, batched inference, embeddings, retrieval, speculative decoding, Android/Swift demos, training examples.
ggml/include, ggml/srcTensor library, graph execution, backend registry, CPU/CUDA/HIP/Metal/Vulkan/SYCL/OpenCL/RPC and other backends.
gguf-py/gguf/*Python GGUF reader/writer, constants, quant helpers, metadata utilities, tensor mapping.
conversion/*, convert_hf_to_gguf.py, convert_lora_to_gguf.pyModel and adapter conversion from HF/PyTorch formats to GGUF.
docs/*Build, install, Docker, multi-GPU, function calling, multimodal, speculative, backend, and model-development guides.
tests/*C/C++ and Python tests for backends, GGUF, quantization, tokenizers, chat templates, grammar, thread safety, model load cancel, server tests.
CMakeLists.txt, CMakePresets.json, MakefileNative build system and presets.
pyproject.tomlPython scripts package metadata; scripts like llama-convert-hf-to-gguf; dependencies for conversion tooling.

Core Concepts

GGUF. GGUF is the single-file model format used by llama.cpp and ggml. The conversion code reads original model configs, tokenizer data, tensor names, and tensor values, then writes standardized metadata and tensors. gguf-py/gguf/constants.py and tensor_mapping.py are central to this contract.

libllama API. include/llama.h is the external contract for embedding llama.cpp into other applications. It exposes backend initialization, model/context creation, batches, encode/decode, samplers, embeddings, and metadata APIs.

ggml graph execution. llama.cpp builds a graph for the model architecture and delegates tensor operations to ggml. Backend implementations under ggml/src/ggml-cpu, ggml-cuda, ggml-metal, ggml-vulkan, ggml-sycl, and others execute the graph.

Context. A llama_context owns runtime state for inference: KV cache, backend scheduler, logits/embeddings buffers, batch handling, and decode state.

Batch and micro-batch. llama_batch can hold one or many sequences. src/llama-batch.cpp splits work into micro-batches (ubatch) to fit execution and memory constraints.

KV cache and memory. src/llama-kv-cache.cpp, llama-memory.cpp, llama-memory-hybrid.cpp, llama-memory-recurrent.cpp, and related files manage context memory. Server flags such as --ctx-size, --cache-type-k, --cache-type-v, --kv-offload, --kv-unified, --cache-prompt, and slot controls expose this behavior.

Quantization. The project supports many quantization types, reducing memory and improving local feasibility. Runtime quantized ops live in ggml backends, while tools such as tools/quantize and scripts such as convert_hf_to_gguf.py produce quantized assets.

Backend selection. Build flags and runtime flags determine whether CPU, Metal, CUDA, HIP, Vulkan, SYCL, OpenCL, RPC, or other paths are available. docs/build.md and docs/multi-gpu.md document operational choices.

Server slots and continuous batching. tools/server supports parallel decoding, multi-user serving, continuous batching, slots monitoring, prompt caching, metrics, and model routing.

Component/System Diagram

flowchart LR User[CLI, HTTP client, native app] --> Tools[tools/cli, tools/server, examples] Tools --> Common[common\nargs, chat, sampling, HF cache, grammar] Tools --> API[include/llama.h\nlibllama C API] API --> Context[src/llama-context.cpp\nruntime context] Context --> Model[src/llama-model.cpp\narchitecture graph] Context --> KV[src/llama-kv-cache + llama-memory\ncache and state] Context --> Sampler[src/llama-sampler.cpp\nsampling chain] Model --> Loader[src/llama-model-loader.cpp\nGGUF tensors and metadata] Loader --> GGUF[gguf-py + ggml gguf\nsingle-file model format] Context --> GGML[ggml graph scheduler] GGML --> Backends[CPU, CUDA, HIP, Metal,\nVulkan, SYCL, OpenCL, RPC] Conversion[convert_hf_to_gguf.py\nconversion/*] --> GGUF Tests[tests + server tests] --> API

Internal Architecture

llama.cpp has a native runtime core surrounded by tools.

Public API layer. include/llama.h is the stable interface consumers compile against. Applications call llama_backend_init, load a model, create a context, build batches, call llama_decode or llama_encode, and inspect logits, embeddings, or tokens. include/llama-cpp.h adds C++ ownership helpers.

Model and metadata layer. src/llama-model-loader.cpp reads GGUF files, tensor metadata, architecture-specific keys, and weight data. src/llama-arch.h and src/llama-arch.cpp define architecture constants, tensor names, and metadata mappings that must align with gguf-py/gguf/constants.py.

Graph construction layer. src/llama-model.cpp builds ggml graphs for supported architectures. The development guide docs/development/HOWTO-add-model.md says new models require conversion support, architecture metadata, graph implementation, and optional multimodal encoder support.

Runtime context layer. src/llama-context.cpp holds execution state, manages decode/encode calls, extracts logits/embeddings, applies output handling, and calls into the backend scheduler.

Memory/KV layer. KV cache and memory are separated into several implementation files so recurrent, hybrid, sliding-window, and standard attention models can be handled.

Shared tool layer. common/arg.cpp, common/common.cpp, common/sampling.cpp, common/chat.cpp, common/hf-cache.cpp, and common/download.cpp are reused by CLI, server, and tools. This avoids each binary reimplementing argument parsing, sampling, chat templates, Hugging Face downloads, or logging.

Backend layer. ggml owns low-level tensor execution and backend registration. ggml/src/ggml-backend-reg.cpp, ggml-backend.cpp, and backend directories provide implementation-specific ops.

End-to-End Runtime Flow

sequenceDiagram participant C as User/client participant T as llama-cli or llama-server participant A as common args/chat/sampling participant L as libllama API participant M as model loader participant X as llama_context participant G as ggml backend scheduler participant S as sampler C->>T: prompt, chat request, or API call T->>A: parse flags, chat template, sampling config A->>L: llama_backend_init + load model/context L->>M: read GGUF metadata and tensors M-->>L: llama_model L-->>X: context with KV cache and backend scheduler loop generation T->>X: build llama_batch and call llama_decode X->>G: execute model graph on selected backend G-->>X: logits / embeddings / updated KV X-->>S: logits S-->>T: next token T-->>C: token text, stream chunk, or JSON delta end

Runtime and Data Flow

  1. Model acquisition. A user passes -m model.gguf, -hf user/repo, --model-url, or Docker model arguments. common/hf-cache.cpp, common/download.cpp, and common_get_model_endpoint() support model cache and endpoint selection.
  2. Model load. llama_model_loader reads GGUF metadata and tensors, often using memory mapping unless disabled by flags such as --no-mmap.
  3. Context initialization. llama_context is created with context parameters, backend devices, KV cache settings, Flash Attention choice, thread settings, and optional offload.
  4. Prompt processing. CLI/server code tokenizes input, applies chat templates from common/chat.cpp, handles grammars/JSON schema, and forms llama_batch.
  5. Graph execution. The architecture-specific graph is constructed and run through ggml. The backend scheduler places tensors and ops on CPU/GPU backends according to build/runtime availability.
  6. KV update. Attention state is stored in the cache and may be offloaded, quantized, unified, shifted, saved, or restored depending on flags and server slot state.
  7. Sampling. src/llama-sampler.cpp and common/sampling.cpp apply sampler chains such as penalties, top-k, top-p, min-p, temperature, Mirostat, grammar constraints, and logit bias.
  8. Output. CLI prints tokens; server streams or returns JSON through tools/server/server-http.cpp, server-task.cpp, server-context.cpp, and route-specific code.

Deployment and Operations Topology

flowchart TB subgraph Local CLI[llama-cli / llama-completion] Lib[Native application embedding libllama] end subgraph ServerNode["llama-server node"] HTTP[tools/server\nhttplib HTTP server] Queue[server queue and slots] Contexts[llama_context instances] UI[Web UI / static assets] end subgraph Models HF[Hugging Face repo] GGUF[Local GGUF files] LoRA[LoRA GGUF adapters] MMProj[Multimodal projector] end subgraph Runtime GGML[ggml backend registry] CPU[CPU SIMD / BLAS] GPU[CUDA/HIP/Metal/Vulkan/SYCL/OpenCL] RPC[RPC backend] end subgraph Ops Docker[GHCR images\nfull/light/server variants] Metrics[Prometheus metrics endpoint] Bench[llama-bench / perplexity] end CLI --> GGUF Lib --> GGUF HTTP --> Queue --> Contexts --> GGML HTTP --> UI HF --> GGUF LoRA --> Contexts MMProj --> Contexts GGML --> CPU GGML --> GPU GGML --> RPC Docker --> HTTP Metrics --> HTTP Bench --> GGML

Deployment patterns:

Lifecycle, Decisions, and Module Dependencies

stateDiagram-v2 [*] --> Build Build --> SelectBackend: CMake flags decide available backends SelectBackend --> AcquireModel: local GGUF, HF cache, URL, Docker volume AcquireModel --> LoadGGUF: metadata and tensors LoadGGUF --> InitContext: context params, KV cache, devices InitContext --> PromptReady: tokenize and apply chat template PromptReady --> DecodeLoop DecodeLoop --> DecodeLoop: batch -> graph -> logits -> sample DecodeLoop --> Finished: EOS, max tokens, reverse prompt, stop DecodeLoop --> Error: OOM, invalid GGUF, backend unsupported Finished --> [*] Error --> [*]
flowchart LR Convert[convert_hf_to_gguf.py\nconversion/*] --> GGUFConstants[gguf-py/gguf/constants.py] GGUFConstants --> Arch[src/llama-arch.h/.cpp] Arch --> Loader[src/llama-model-loader.cpp] Loader --> Model[src/llama-model.cpp] Model --> Context[src/llama-context.cpp] Context --> Batch[src/llama-batch.cpp] Context --> Memory[src/llama-kv-cache*.cpp\nllama-memory*.cpp] Context --> GGML[ggml/src/ggml-backend*.cpp] GGML --> BackendDirs[ggml/src/ggml-cpu\ncuda, metal, vulkan, sycl]

Extension Points

Integrations

Configuration, Deployment, and Ops

llama.cpp configuration is split between build-time and runtime.

Build-time: CMake options select backend availability. docs/build.md covers CPU, BLAS, Metal, SYCL, CUDA, MUSA, HIP, Vulkan, CANN, ZenDNN, KleidiAI, OpenCL, Android, and OpenVINO. Build flags determine whether a binary can use a backend at all.

Runtime: common/arg.cpp centralizes flags for threads, CPU affinity, context size, batch/micro-batch size, Flash Attention, RoPE scaling, KV cache types, mmap/mlock/direct I/O, devices, GPU layers, split mode, LoRA, model source, logging, sampling, grammar, server host/port, API key, TLS, metrics, slots, props, prompt cache, and model routing.

Operational practices:

Observability, Testing, Evaluation, and Failure Modes

Observability and benchmarking:

Testing anchors:

Common failure modes:

Security and Governance Risks

Reading Guide

  1. Start with README.md for goals, quick start, and supported model/hardware claims.
  2. Read docs/build.md, docs/docker.md, and docs/multi-gpu.md to understand deployment constraints.
  3. Read include/llama.h before reading internal implementation.
  4. Trace model loading through src/llama-model-loader.cpp and architecture constants in src/llama-arch.*.
  5. Study src/llama-context.cpp, src/llama-batch.cpp, and src/llama-kv-cache.cpp for runtime behavior.
  6. Read src/llama-model.cpp for graph construction.
  7. Read ggml/include/ggml.h, ggml/include/ggml-backend.h, and backend directories for execution.
  8. Read tools/server/README.md and tools/server/*.cpp for service operation.
  9. Read docs/development/HOWTO-add-model.md before adding model support.
  10. Use tests/* to understand expected behavior before modifying internals.

Learning Path

  1. Build a CPU-only mental model with examples/simple/simple.cpp.
  2. Follow tools/cli/main.cpp into common argument parsing and libllama.
  3. Inspect how llama_batch is created and decoded.
  4. Read the sampling chain in src/llama-sampler.cpp and common/sampling.cpp.
  5. Load the server architecture from tools/server/server.cpp, queue, context, task, HTTP, and model files.
  6. Study GGUF conversion and constants before touching architecture support.
  7. Compare backend implementations only after understanding the ggml backend API.
  8. Use benchmarks and tests to validate any performance or model-support change.

Production Readiness And Native Runtime Gate

llama.cpp readiness is split across build-time, model-time, and runtime gates. The important source anchors are include/llama.h, src/llama-model-loader.cpp, src/llama-context.cpp, src/llama-kv-cache*.cpp, src/llama-sampler.cpp, common/arg.cpp, tools/server/*, ggml/src/*, convert_hf_to_gguf.py, gguf-py/gguf/*, and docs/build.md.

GateWhat to verify
BuildBinary was compiled with the intended backend: CPU BLAS, CUDA, HIP, Metal, Vulkan, SYCL, OpenCL, RPC, or vendor-specific options.
GGUF provenanceSource model, conversion script version, tokenizer metadata, quantization type, and license are recorded.
Memory fit--ctx-size, --batch-size, --ubatch-size, KV cache type, --n-gpu-layers, slots, and split mode fit RAM/VRAM.
Server exposurellama-server is behind auth, TLS/reverse proxy, rate limits, and only required routes/features are enabled.
Tool surfaceExperimental server tools, local media paths, file access, shell-like actions, and static UI are disabled unless explicitly governed.
Observability--metrics, logs, llama-bench, perplexity checks, and model-load diagnostics are captured before canary traffic.
flowchart LR Build[CMake build flags] --> Backend[ggml backend registry] Backend --> Binary[llama-cli, llama-server, libllama] Model[HF or local model] --> Convert[convert_hf_to_gguf.py and gguf-py] Convert --> GGUF[GGUF metadata and tensors] GGUF --> Loader[src/llama-model-loader.cpp] Binary --> Loader Loader --> Context[src/llama-context.cpp] Context --> KV[src/llama-kv-cache and memory files] Context --> Sampler[src/llama-sampler.cpp] Context --> Server[tools/server queue, slots, routes] Server --> Metrics[Prometheus metrics and logs]

Failure Isolation Map

Native runtimes fail differently from Python serving stacks. A single symptom such as "slow generation" can be caused by build flags, model format, backend placement, KV cache settings, sampler configuration, server queue pressure, or client misuse.

flowchart TD Symptom[llama.cpp symptom] --> Domain{Domain} Domain --> Build[Backend not compiled or wrong binary] Domain --> GGUF[Invalid GGUF or tokenizer metadata] Domain --> Memory[RAM, VRAM, KV, context, slots] Domain --> Backend[Backend op or split-mode issue] Domain --> Sampler[Chat template, grammar, sampling] Domain --> Server[Queue, route, auth, streaming] Domain --> Security[Tools, media path, API exposure] Build --> Files1[docs/build.md and CMake files] GGUF --> Files2[gguf-py, conversion, llama-model-loader] Memory --> Files3[llama-context, llama-kv-cache, common/arg.cpp] Backend --> Files4[ggml/src backends and docs/multi-gpu.md] Sampler --> Files5[common/chat.cpp, common/sampling.cpp, llama-sampler.cpp] Server --> Files6[tools/server server-http, queue, context] Security --> Files7[tools/server README and runtime flags] Files1 --> Remediate[Rebuild, reconvert, retune, or isolate] Files2 --> Remediate Files3 --> Remediate Files4 --> Remediate Files5 --> Remediate Files6 --> Remediate Files7 --> Remediate

Glossary

TermMeaning
GGUFSingle-file model format containing metadata and tensors for ggml/llama.cpp inference.
ggmlTensor and graph execution library used by llama.cpp.
libllamaNative library interface exposed through include/llama.h.
llama_contextRuntime state for inference, including KV cache, backend scheduler, logits, and embeddings.
llama_batchInput structure containing tokens/embeddings, positions, sequence IDs, and output flags.
KV cacheAttention key/value memory from previous tokens.
ubatchMicro-batch produced from a larger llama_batch.
mmapMemory mapping model files to reduce load overhead and memory copying.
mlockRequest to keep model pages in RAM.
n-gpu-layersRuntime option controlling how many layers are offloaded to GPU.
split modeMulti-GPU strategy such as layer or tensor.
LoRA adapterLow-rank adapter loaded to modify model behavior without replacing the base model.
imatrixImportance matrix used to guide quantization choices.
mtmdllama.cpp multimodal library/tool area.
slotServer-side concurrent sequence/context lane.