AI Solution Architecture

Deep Dives

View source

Kien truc llama.cpp

Anh chup nguon: github-repos/02-model-serving-inference/llama.cpp tai commit bfb4308 (model : support granite multilingual embeddings R2 ... (#22716), tag b9481). Tai lieu nay dua tren cac tep va thu muc co trong anh chup do.

Tom tat dieu hanh

llama.cpp la stack inference C/C++ native de chay LLM voi thiet lap toi thieu tren may local, edge device va cloud instance. README tom tat muc tieu la "LLM inference in C/C++" va nhan manh it dependency, ho tro Apple Silicon, cac duong CPU SIMD, nhieu dinh dang quantization, CUDA/HIP/Vulkan/SYCL/OpenCL/Metal va cac backend khac, cung CPU/GPU hybrid inference.

San pham cot loi la thu vien llama, voi public C API trong include/llama.h va C++ RAII helpers trong include/llama-cpp.h. Xung quanh core do la tooling chuyen model sang GGUF, quantization, CLI, OpenAI-compatible HTTP server, benchmarking, multimodal support, tests va thu vien tensor/runtime ggml.

Voi solution architect, llama.cpp co gia tri khi dich trien khai can portability, footprint nho, local/edge execution, quantized-model support manh, hoac native integration. No co the chay nhu CLI local, embedded library, Docker container, web server, mobile demo hoac service multi-GPU tuy backend. Doi lai, model support, runtime features va tang toc phan cung phu thuoc chat vao GGUF metadata, graph construction code, backend kernels va build flags.

Bai toan duoc giai quyet

Nhieu model serving stack mac dinh can Python runtime, GPU lon va managed server. llama.cpp phuc vu mo hinh khac:

Diem neo trong repo gom include/llama.h, src/llama.cpp, src/llama-context.cpp, src/llama-model.cpp, src/llama-kv-cache.cpp, tools/server/*, tools/cli/*, tools/quantize/*, convert_hf_to_gguf.py, gguf-py/gguf/*, ggml/src/* va tests/*.

Vai tro trong AI stack

llama.cpp la inference runtime va toolkit model serving. No khong chu yeu dinh nghia model cho training nhu Transformers, va khong phai Python-first high-throughput server nhu vLLM. Vai tro cua no gan voi mot portable native execution layer:

Ban do source tree

Duong danVai tro
README.mdTong quan du an, quick start, model families ho tro, hardware support, link install/build.
include/llama.hPublic C API: model/context params, backend init/free, model loading, llama_batch, llama_encode, llama_decode, samplers, embeddings, KV operations.
include/llama-cpp.hC++ RAII helpers cho llama_model, llama_context, llama_sampler.
src/llama.cppGlue implementation cho public API export.
src/llama-model.cpp, src/llama-model.hModel representation, graph construction theo architecture, logic RoPE/model.
src/llama-model-loader.cppNap GGUF model va xu ly metadata/tensor.
src/llama-context.cppRuntime context, decode/encode execution, output extraction, embeddings, logits, behavior hieu nang.
src/llama-batch.cppCap phat batch va chia thanh micro-batch.
src/llama-kv-cache*.cpp, src/llama-memory*.cppKV-cache va memory implementations, gom hybrid va recurrent paths.
src/llama-sampler.cppSampling chain va chon token.
src/llama-vocab.cpp, src/unicode*.cppTokenization, vocabulary, Unicode.
src/llama-adapter.cppNap LoRA/control-vector adapters va xu ly backend buffer.
common/*Utilities dung chung cho CLI/tools/server: args, sampling, chat templates, HF cache/download, logging, JSON/grammar helpers, speculative decoding.
tools/server/*HTTP server dua tren httplib va nlohmann::json; OpenAI-compatible, Anthropic-compatible, web UI, slots, queue, model routing.
tools/cli/*, tools/completion/*Interactive CLI va completion utilities.
tools/quantize/*, tools/imatrix/*, tools/gguf-split/*Quantization, importance matrix, GGUF splitting va model tooling.
tools/llama-bench/*, tools/perplexity/*Benchmarking va quality/perplexity utilities.
tools/mtmd/*Multimodal library va model-specific encoder/projector support.
examples/*Vi du dung library toi thieu, batched inference, embeddings, retrieval, speculative decoding, Android/Swift demos, training examples.
ggml/include, ggml/srcTensor library, graph execution, backend registry, CPU/CUDA/HIP/Metal/Vulkan/SYCL/OpenCL/RPC va backend khac.
gguf-py/gguf/*Python GGUF reader/writer, constants, quant helpers, metadata utilities, tensor mapping.
conversion/*, convert_hf_to_gguf.py, convert_lora_to_gguf.pyChuyen model/adapters tu HF/PyTorch formats sang GGUF.
docs/*Build, install, Docker, multi-GPU, function calling, multimodal, speculative, backend va model-development guides.
tests/*C/C++ va Python tests cho backends, GGUF, quantization, tokenizers, chat templates, grammar, thread safety, model load cancel, server.
CMakeLists.txt, CMakePresets.json, MakefileNative build system va presets.
pyproject.tomlMetadata Python scripts; scripts nhu llama-convert-hf-to-gguf; dependencies cho conversion tooling.

Khai niem cot loi

GGUF. GGUF la dinh dang single-file cua llama.cpp va ggml. Conversion code doc config goc, tokenizer data, tensor names va tensor values, roi ghi metadata va tensors chuan hoa. gguf-py/gguf/constants.py va tensor_mapping.py la hop dong trung tam.

libllama API. include/llama.h la contract ben ngoai de embed llama.cpp vao ung dung. No expose backend initialization, model/context creation, batches, encode/decode, samplers, embeddings va metadata APIs.

ggml graph execution. llama.cpp xay graph cho architecture model va giao tensor operations cho ggml. Backend trong ggml/src/ggml-cpu, ggml-cuda, ggml-metal, ggml-vulkan, ggml-sycl va cac thu muc khac thuc thi graph.

Context. llama_context giu runtime state cho inference: KV cache, backend scheduler, logits/embeddings buffers, batch handling va decode state.

Batch va micro-batch. llama_batch co the chua mot hoac nhieu sequence. src/llama-batch.cpp chia work thanh micro-batch (ubatch) de phu hop execution va memory constraints.

KV cache va memory. src/llama-kv-cache.cpp, llama-memory.cpp, llama-memory-hybrid.cpp, llama-memory-recurrent.cpp va cac file lien quan quan ly context memory. Server flags nhu --ctx-size, --cache-type-k, --cache-type-v, --kv-offload, --kv-unified, --cache-prompt, slot controls expose hanh vi nay.

Quantization. Du an ho tro nhieu quantization types, giam memory va giup local inference kha thi. Runtime quantized ops nam trong ggml backends, con tools nhu tools/quantize va scripts nhu convert_hf_to_gguf.py tao assets quantized.

Backend selection. Build flags va runtime flags quyet dinh CPU, Metal, CUDA, HIP, Vulkan, SYCL, OpenCL, RPC hoac path khac co san hay khong. docs/build.md va docs/multi-gpu.md mo ta cac lua chon van hanh.

Server slots va continuous batching. tools/server ho tro parallel decoding, multi-user serving, continuous batching, slots monitoring, prompt caching, metrics va model routing.

So do thanh phan he thong

flowchart LR User[CLI, HTTP client, native app] --> Tools[tools/cli, tools/server, examples] Tools --> Common[common\nargs, chat, sampling, HF cache, grammar] Tools --> API[include/llama.h\nlibllama C API] API --> Context[src/llama-context.cpp\nruntime context] Context --> Model[src/llama-model.cpp\narchitecture graph] Context --> KV[src/llama-kv-cache + llama-memory\ncache va state] Context --> Sampler[src/llama-sampler.cpp\nsampling chain] Model --> Loader[src/llama-model-loader.cpp\nGGUF tensors va metadata] Loader --> GGUF[gguf-py + ggml gguf\nsingle-file model format] Context --> GGML[ggml graph scheduler] GGML --> Backends[CPU, CUDA, HIP, Metal,\nVulkan, SYCL, OpenCL, RPC] Conversion[convert_hf_to_gguf.py\nconversion/*] --> GGUF Tests[tests + server tests] --> API

Kien truc noi bo

llama.cpp co native runtime core va cac tool bao quanh.

Public API layer. include/llama.h la interface on dinh cho consumer compile. Ung dung goi llama_backend_init, load model, tao context, build batch, goi llama_decode hoac llama_encode, roi doc logits, embeddings hoac token. include/llama-cpp.h them ownership helpers cho C++.

Model va metadata layer. src/llama-model-loader.cpp doc GGUF files, tensor metadata, architecture-specific keys va weight data. src/llama-arch.h va src/llama-arch.cpp dinh nghia architecture constants, tensor names va metadata mappings phai dong bo voi gguf-py/gguf/constants.py.

Graph construction layer. src/llama-model.cpp xay ggml graphs cho architectures duoc ho tro. Guide docs/development/HOWTO-add-model.md noi model moi can conversion support, architecture metadata, graph implementation va tuy chon multimodal encoder support.

Runtime context layer. src/llama-context.cpp giu execution state, quan ly decode/encode calls, trich logits/embeddings, xu ly output va goi backend scheduler.

Memory/KV layer. KV cache va memory duoc tach thanh nhieu file de xu ly recurrent, hybrid, sliding-window va standard attention models.

Shared tool layer. common/arg.cpp, common/common.cpp, common/sampling.cpp, common/chat.cpp, common/hf-cache.cpp, common/download.cpp duoc CLI, server va tools dung lai. Nho do moi binary khong phai tu cai dat argument parsing, sampling, chat templates, HF downloads hay logging.

Backend layer. ggml quan ly low-level tensor execution va backend registration. ggml/src/ggml-backend-reg.cpp, ggml-backend.cpp va backend directories cung cap ops theo implementation.

Luong runtime dau cuoi

sequenceDiagram participant C as User/client participant T as llama-cli hoac llama-server participant A as common args/chat/sampling participant L as libllama API participant M as model loader participant X as llama_context participant G as ggml backend scheduler participant S as sampler C->>T: prompt, chat request, hoac API call T->>A: parse flags, chat template, sampling config A->>L: llama_backend_init + load model/context L->>M: doc GGUF metadata va tensors M-->>L: llama_model L-->>X: context voi KV cache va backend scheduler loop generation T->>X: build llama_batch va goi llama_decode X->>G: execute model graph tren backend da chon G-->>X: logits / embeddings / updated KV X-->>S: logits S-->>T: next token T-->>C: token text, stream chunk, hoac JSON delta end

Runtime va data flow

  1. Lay model. User truyen -m model.gguf, -hf user/repo, --model-url hoac Docker model args. common/hf-cache.cpp, common/download.cpp, common_get_model_endpoint() ho tro cache va endpoint.
  2. Nap model. llama_model_loader doc GGUF metadata va tensors, thuong dung memory mapping tru khi bi tat boi --no-mmap.
  3. Khoi tao context. llama_context duoc tao voi context params, backend devices, KV cache settings, Flash Attention, thread settings va offload tuy chon.
  4. Xu ly prompt. CLI/server tokenize input, ap dung chat templates tu common/chat.cpp, xu ly grammars/JSON schema va tao llama_batch.
  5. Graph execution. Graph theo architecture duoc xay va chay qua ggml. Backend scheduler dat tensors va ops len CPU/GPU backends theo build/runtime availability.
  6. Cap nhat KV. Attention state duoc luu trong cache va co the offload, quantize, unified, shift, save hoac restore theo flags va server slot state.
  7. Sampling. src/llama-sampler.cpp va common/sampling.cpp ap dung sampler chains nhu penalties, top-k, top-p, min-p, temperature, Mirostat, grammar constraints va logit bias.
  8. Output. CLI in tokens; server stream hoac tra JSON qua tools/server/server-http.cpp, server-task.cpp, server-context.cpp va route-specific code.

Topology trien khai va van hanh

flowchart TB subgraph Local CLI[llama-cli / llama-completion] Lib[Native application embedding libllama] end subgraph ServerNode["llama-server node"] HTTP[tools/server\nhttplib HTTP server] Queue[server queue va slots] Contexts[llama_context instances] UI[Web UI / static assets] end subgraph Models HF[Hugging Face repo] GGUF[Local GGUF files] LoRA[LoRA GGUF adapters] MMProj[Multimodal projector] end subgraph Runtime GGML[ggml backend registry] CPU[CPU SIMD / BLAS] GPU[CUDA/HIP/Metal/Vulkan/SYCL/OpenCL] RPC[RPC backend] end subgraph Ops Docker[GHCR images\nfull/light/server variants] Metrics[Prometheus metrics endpoint] Bench[llama-bench / perplexity] end CLI --> GGUF Lib --> GGUF HTTP --> Queue --> Contexts --> GGML HTTP --> UI HF --> GGUF LoRA --> Contexts MMProj --> Contexts GGML --> CPU GGML --> GPU GGML --> RPC Docker --> HTTP Metrics --> HTTP Bench --> GGML

Cac mau trien khai:

Vong doi, quyet dinh va phu thuoc module

stateDiagram-v2 [*] --> Build Build --> SelectBackend: CMake flags quyet dinh backends SelectBackend --> AcquireModel: local GGUF, HF cache, URL, Docker volume AcquireModel --> LoadGGUF: metadata va tensors LoadGGUF --> InitContext: context params, KV cache, devices InitContext --> PromptReady: tokenize va ap dung chat template PromptReady --> DecodeLoop DecodeLoop --> DecodeLoop: batch -> graph -> logits -> sample DecodeLoop --> Finished: EOS, max tokens, reverse prompt, stop DecodeLoop --> Error: OOM, invalid GGUF, backend unsupported Finished --> [*] Error --> [*]
flowchart LR Convert[convert_hf_to_gguf.py\nconversion/*] --> GGUFConstants[gguf-py/gguf/constants.py] GGUFConstants --> Arch[src/llama-arch.h/.cpp] Arch --> Loader[src/llama-model-loader.cpp] Loader --> Model[src/llama-model.cpp] Model --> Context[src/llama-context.cpp] Context --> Batch[src/llama-batch.cpp] Context --> Memory[src/llama-kv-cache*.cpp\nllama-memory*.cpp] Context --> GGML[ggml/src/ggml-backend*.cpp] GGML --> BackendDirs[ggml/src/ggml-cpu\ncuda, metal, vulkan, sycl]

Diem mo rong

Tich hop

Cau hinh, trien khai va ops

Cau hinh llama.cpp tach thanh build-time va runtime.

Build-time: CMake options chon backend availability. docs/build.md bao phu CPU, BLAS, Metal, SYCL, CUDA, MUSA, HIP, Vulkan, CANN, ZenDNN, KleidiAI, OpenCL, Android va OpenVINO. Build flags quyet dinh binary co dung backend duoc hay khong.

Runtime: common/arg.cpp tap trung flags cho threads, CPU affinity, context size, batch/micro-batch size, Flash Attention, RoPE scaling, KV cache types, mmap/mlock/direct I/O, devices, GPU layers, split mode, LoRA, model source, logging, sampling, grammar, server host/port, API key, TLS, metrics, slots, props, prompt cache va model routing.

Thuc hanh van hanh:

Observability, testing, evaluation va failure modes

Observability va benchmarking:

Diem neo testing:

Failure modes pho bien:

Rui ro bao mat va governance

Huong dan doc source

  1. Bat dau voi README.md de hieu muc tieu, quick start va model/hardware claims.
  2. Doc docs/build.md, docs/docker.md, docs/multi-gpu.md de hieu rang buoc trien khai.
  3. Doc include/llama.h truoc khi doc implementation.
  4. Trace model loading qua src/llama-model-loader.cpp va constants trong src/llama-arch.*.
  5. Hoc src/llama-context.cpp, src/llama-batch.cpp, src/llama-kv-cache.cpp cho runtime behavior.
  6. Doc src/llama-model.cpp cho graph construction.
  7. Doc ggml/include/ggml.h, ggml/include/ggml-backend.h va backend directories cho execution.
  8. Doc tools/server/README.md va tools/server/*.cpp cho service operation.
  9. Doc docs/development/HOWTO-add-model.md truoc khi them model support.
  10. Dung tests/* de hieu expected behavior truoc khi sua internals.

Lo trinh hoc

  1. Tao mental model CPU-only voi examples/simple/simple.cpp.
  2. Theo tools/cli/main.cpp vao common argument parsing va libllama.
  3. Xem cach llama_batch duoc tao va decode.
  4. Doc sampling chain trong src/llama-sampler.cpp va common/sampling.cpp.
  5. Nap architecture server tu tools/server/server.cpp, queue, context, task, HTTP va model files.
  6. Hoc GGUF conversion va constants truoc khi cham architecture support.
  7. So sanh backend implementations sau khi da hieu ggml backend API.
  8. Dung benchmarks va tests de validate moi thay doi ve performance hoac model support.

Checklist production và cổng runtime native

Readiness của llama.cpp được tách thành gate build-time, model-time và runtime. Các neo source quan trọng gồm include/llama.h, src/llama-model-loader.cpp, src/llama-context.cpp, src/llama-kv-cache*.cpp, src/llama-sampler.cpp, common/arg.cpp, tools/server/*, ggml/src/*, convert_hf_to_gguf.py, gguf-py/gguf/*docs/build.md.

GateCần xác minh
BuildBinary được compile với backend mong muốn: CPU BLAS, CUDA, HIP, Metal, Vulkan, SYCL, OpenCL, RPC hoặc option vendor-specific.
GGUF provenanceGhi lại source model, version conversion script, tokenizer metadata, quantization type và license.
Memory fit--ctx-size, --batch-size, --ubatch-size, KV cache type, --n-gpu-layers, slots và split mode phải vừa RAM/VRAM.
Server exposurellama-server phải đứng sau auth, TLS/reverse proxy, rate limit và chỉ bật route/feature cần thiết.
Tool surfaceExperimental server tools, local media paths, file access, shell-like actions và static UI phải tắt trừ khi có governance rõ.
Observability--metrics, logs, llama-bench, perplexity checks và model-load diagnostics cần được thu trước canary traffic.
flowchart LR Build[CMake build flags] --> Backend[ggml backend registry] Backend --> Binary[llama-cli, llama-server, libllama] Model[HF hoac local model] --> Convert[convert_hf_to_gguf.py va gguf-py] Convert --> GGUF[GGUF metadata va tensors] GGUF --> Loader[src/llama-model-loader.cpp] Binary --> Loader Loader --> Context[src/llama-context.cpp] Context --> KV[src/llama-kv-cache va memory files] Context --> Sampler[src/llama-sampler.cpp] Context --> Server[tools/server queue, slots, routes] Server --> Metrics[Prometheus metrics va logs]

Bản đồ cô lập lỗi

Runtime native lỗi khác với Python serving stack. Một triệu chứng như "generation chậm" có thể đến từ build flags, model format, backend placement, KV cache settings, sampler configuration, áp lực queue ở server hoặc client misuse.

flowchart TD Symptom[Trieu chung llama.cpp] --> Domain{Domain} Domain --> Build[Backend chua compile hoac sai binary] Domain --> GGUF[GGUF hoac tokenizer metadata sai] Domain --> Memory[RAM, VRAM, KV, context, slots] Domain --> Backend[Backend op hoac split-mode issue] Domain --> Sampler[Chat template, grammar, sampling] Domain --> Server[Queue, route, auth, streaming] Domain --> Security[Tools, media path, API exposure] Build --> Files1[docs/build.md va CMake files] GGUF --> Files2[gguf-py, conversion, llama-model-loader] Memory --> Files3[llama-context, llama-kv-cache, common/arg.cpp] Backend --> Files4[ggml/src backends va docs/multi-gpu.md] Sampler --> Files5[common/chat.cpp, common/sampling.cpp, llama-sampler.cpp] Server --> Files6[tools/server server-http, queue, context] Security --> Files7[tools/server README va runtime flags] Files1 --> Remediate[Rebuild, reconvert, retune hoac isolate] Files2 --> Remediate Files3 --> Remediate Files4 --> Remediate Files5 --> Remediate Files6 --> Remediate Files7 --> Remediate

Bang chu giai

Thuat nguNghia
GGUFDinh dang single-file gom metadata va tensors cho inference ggml/llama.cpp.
ggmlThu vien tensor va graph execution duoc llama.cpp su dung.
libllamaNative library interface expose qua include/llama.h.
llama_contextRuntime state cho inference, gom KV cache, backend scheduler, logits va embeddings.
llama_batchCau truc input gom tokens/embeddings, positions, sequence IDs va output flags.
KV cacheBo nho key/value attention tu token truoc.
ubatchMicro-batch duoc tao tu llama_batch lon hon.
mmapMemory mapping model files de giam load overhead va memory copy.
mlockYeu cau giu model pages trong RAM.
n-gpu-layersOption runtime dieu khien so layer offload len GPU.
split modeChien luoc multi-GPU nhu layer hoac tensor.
LoRA adapterAdapter low-rank thay doi hanh vi model ma khong thay base model.
imatrixImportance matrix dung de huong dan quantization.
mtmdKhu vuc library/tool multimodal cua llama.cpp.
slotLane context/sequence dong thoi phia server.