AI Solution Architecture

Deep Dives

View source

MLflow Architecture Notes

Executive summary

MLflow is a broad AI engineering platform for agents, LLM applications, and traditional ML models. The README positions it as a platform for debugging, evaluation, monitoring, prompt management, prompt optimization, AI Gateway governance, experiment tracking, model registry, and deployment. The repository is correspondingly large: mlflow/ contains the Python package, docs/ contains documentation, examples/ demonstrates workflows, tests/ mirrors the package structure, charts/ provides Kubernetes deployment, and docker-compose/ provides a local Postgres plus S3-compatible artifact store setup.

The root pyproject.toml identifies version 3.13.1.dev0, Python >=3.10, and dependencies including Flask, FastAPI, SQLAlchemy, Alembic, OpenTelemetry, Huey, Databricks SDK, Docker, Graphene, Pydantic, Uvicorn, and many integration libraries. MLflow is not a single-purpose observability tool. It is a platform with a tracking server, artifact stores, model registry, prompt registry, tracing APIs, GenAI evaluation, scorers, judges, gateway routing, deployment providers, and a React UI served by the Python server.

Problem solved

MLflow solves lifecycle fragmentation. AI teams need to track runs and metrics, store artifacts, manage models, inspect LLM traces, compare prompts, evaluate agent quality, govern model-provider access, and deploy assets. Without a platform, these concerns scatter across notebooks, object stores, model APIs, APM tools, spreadsheets, and custom dashboards. MLflow provides a common tracking API, server, storage model, UI, and extension system that spans classic ML and modern GenAI workflows.

AI stack role

In an AI solution architecture, MLflow can play several roles:

Source tree map

Repository evidence:

Core concepts

Internal architecture

graph TB User[SDK and CLI users] --> Fluent[mlflow.tracking.fluent] User --> Client[MlflowClient] Fluent --> Client Client --> StoreRegistry[Tracking service registry] StoreRegistry --> LocalStore[FileStore or SQLAlchemyStore] StoreRegistry --> RestStore[RestStore] RestStore --> Server[MLflow server] Server --> FastAPI[FastAPI wrapper] FastAPI --> Flask[Flask compatibility app] FastAPI --> OTel[OTEL API router] FastAPI --> GatewayRouter[Gateway router] Flask --> Handlers[server handlers] Handlers --> Backend[(Backend store)] Handlers --> Artifacts[(Artifact repositories)] GatewayRouter --> Providers[Gateway providers] GenAI[mlflow.genai scorers and judges] --> Client UI[React UI static assets] --> Server

MLflow deliberately preserves backward compatibility. mlflow/server/__init__.py still owns the Flask app and registers the long-standing REST and AJAX handlers. mlflow/server/fastapi_app.py wraps that Flask app with FastAPI so newer routers can take precedence for OTEL, jobs, gateway, and assistant functionality. This layered approach allows modern GenAI endpoints to evolve without breaking existing tracking clients.

The store contract is central. AbstractStore defines experiment/run operations but also newer trace APIs such as start_trace, get_trace, search_traces, trace deletion, archival, trace metrics, assessment logging, session queries, prompt-to-trace links, and run-to-trace links. Concrete stores decide whether they support each operation and how persistence is implemented.

Runtime and data flow

sequenceDiagram participant App as AI or ML application participant SDK as MLflow SDK participant Server as MLflow FastAPI plus Flask server participant Store as Tracking backend store participant Artifact as Artifact store participant UI as MLflow UI participant Eval as GenAI evaluators and scorers App->>SDK: log params, metrics, artifacts, traces, prompts SDK->>Server: REST or local store calls Server->>Store: persist experiments, runs, trace metadata Server->>Artifact: upload models, files, trace artifacts Eval->>SDK: log assessments and scorer results UI->>Server: query runs, traces, prompts, models Server->>Store: search and aggregate metadata Server->>Artifact: fetch files and trace payloads

The same client abstractions can target local file stores, database stores, Databricks-backed stores, or a remote MLflow server. In a local script, store calls may not cross a network boundary. In a shared deployment, the SDK talks to the tracking server, and the server writes metadata to a backend database while storing large artifacts in object storage.

For GenAI tracing, model-provider integrations and autologging can emit spans and traces. The server exposes trace artifacts for UI rendering, and mlflow.genai can attach assessments through scorers and judges. Prompt versions can be linked to traces so quality regressions have lineage.

Deployment and operations topology

graph LR subgraph Clients Notebook[Notebooks] Services[Production services] CI[CI evaluation jobs] end subgraph MLflowRuntime Server[MLflow server] UI[Web UI] Gateway[AI Gateway] end subgraph Storage DB[(Postgres or other SQL backend)] Artifacts[(S3, RustFS, Azure, GCS, local PV)] end subgraph Ops Prom[Prometheus metrics] TLS[TLS and ingress] Cron[Cleanup cron job] Providers[LLM providers] end Notebook --> Server Services --> Server CI --> Server Server --> DB Server --> Artifacts UI --> Server Gateway --> Providers Server --> Prom TLS --> Server Cron --> DB Cron --> Artifacts

The local docker-compose/ topology runs Postgres, RustFS as S3-compatible artifact storage, an initialization container for the bucket, and the MLflow server. Key variables include MLFLOW_BACKEND_STORE_URI, MLFLOW_ARTIFACTS_DESTINATION, MLFLOW_S3_ENDPOINT_URL, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, MLFLOW_HOST, and MLFLOW_PORT.

The Helm chart in charts/ supports a Kubernetes server with backend store URI, registry store URI, default artifact root, artifacts destination, env injection, Prometheus exposure, TLS, persistence for local storage, and a cleanup cron job for deleted runs/experiments/artifacts. The chart README explicitly warns that SQLite and local file storage are not suitable for production or high concurrency.

Lifecycle and module dependency diagram

stateDiagram-v2 [*] --> Track Track --> Register Track --> Trace Trace --> Evaluate Evaluate --> Compare Compare --> OptimizePrompt OptimizePrompt --> Trace Register --> Deploy Deploy --> Monitor Monitor --> Trace Trace --> ArchiveOrDelete ArchiveOrDelete --> [*]

This lifecycle spans both traditional ML and GenAI. Classic tracking logs experiments and runs. Registry and deployment promote models or prompts. Tracing captures online behavior. Evaluation and scorers attach quality signals. Prompt optimization and discovery modules feed new versions back into the loop. Archive/delete paths protect storage and governance requirements.

Extension points

Integrations

MLflow has one of the broadest integration surfaces in this group. Source directories include OpenAI, Anthropic, Bedrock, Gemini, Groq, LiteLLM, LlamaIndex, LangChain, LangGraph, CrewAI, AutoGen, DSPy, Pydantic AI, Semantic Kernel, Transformers, PyTorch, TensorFlow, sklearn, Spark, XGBoost, Azure, SageMaker, Kubernetes, Databricks, MCP, and many more. Gateway providers cover major LLM providers, while GenAI scorer integrations include Phoenix, Ragas, Deepeval, TruLens, Google ADK, guardrails, and online trace/session scoring.

Configuration, deployment, and operations

Important configuration groups:

Production deployments should use a relational backend such as Postgres or MySQL for metadata, object storage for artifacts, TLS at ingress, configured allowed hosts, explicit authentication, and secret-backed credentials. Local SQLite and local artifacts are useful for experiments but are not a production design.

Observability, testing, evaluation, and failure modes

The tests/ directory mirrors most package areas: tracking, stores, server, gateway, GenAI, tracing, model flavors, integrations, artifacts, deployment, and CLI. The source itself includes observability hooks: Prometheus exporter activation in mlflow/server/__init__.py, OTEL API routing in fastapi_app.py, gateway timing middleware, trace metrics store APIs, and online scoring processors in mlflow/genai/scorers/online/.

Failure modes to plan for:

Security and governance risks

MLflow can store model artifacts, datasets, prompt text, traces, tool inputs, generated outputs, provider credentials, and experiment metadata. Governance controls should include authentication and authorization, isolated artifact buckets, secret-backed store URIs, TLS, host allowlists, auditability around model/prompt promotion, retention policies, and clear separation between local dev and shared production tracking.

For GenAI, the biggest data risk is trace content. Inputs, retrieved context, tool arguments, and model outputs may include sensitive customer or enterprise data. Teams should define logging filters, retention windows, and access rules before enabling automatic tracing in production.

Reading guide

  1. Read README.md for product scope and quickstart.
  2. Read pyproject.toml for dependencies, extras, and entry points.
  3. Read mlflow/server/__init__.py and mlflow/server/fastapi_app.py for server architecture.
  4. Read mlflow/store/tracking/abstract_store.py before reading concrete stores.
  5. Read mlflow/tracking/client.py and mlflow/tracking/fluent.py to understand user-facing APIs.
  6. Read mlflow/tracing/ and trace entities for LLM observability.
  7. Read mlflow/genai/ for evaluation, scorers, judges, prompts, optimization, and online scoring.
  8. Read mlflow/gateway/app.py and mlflow/gateway/providers/ for model-provider governance.
  9. Read docker-compose/README.md and charts/README.md for deployment tradeoffs.

Learning path

  1. Start with a basic run: parameters, metrics, artifacts.
  2. Add a model or prompt registry workflow.
  3. Add LLM tracing and inspect how traces are stored and rendered.
  4. Add a scorer or judge and log assessments against traces.
  5. Add an AI Gateway route and study provider routing and rate limits.
  6. Move from local file/SQLite storage to a server with SQL backend and object storage.

Glossary

Repository-Grounded Deep Dive

MLflow is not a single service boundary; it is a set of tracking, registry, artifact, model packaging, gateway, and GenAI tracing subsystems that can be run locally or behind a tracking server. The repository shows these boundaries in github-repos/05-observability-evaluation-llmops/mlflow/mlflow/tracking/, mlflow/store/, mlflow/server/, mlflow/models/, mlflow/tracing/, mlflow/genai/, mlflow/gateway/, and mlflow/evaluation/. Deployment assets in docker/, docker-compose/, and charts/ should be read as operational examples, not as the architecture itself.

flowchart LR User["ML or GenAI application"] --> Fluent["fluent APIs and clients"] Fluent --> Tracking["tracking service mlflow/tracking"] Tracking --> BackendStore["backend store mlflow/store/tracking"] Tracking --> ArtifactStore["artifact repositories mlflow/store/artifact"] Fluent --> Models["model packaging mlflow/models and flavors"] Fluent --> Tracing["GenAI tracing mlflow/tracing"] Tracing --> Assess["assessments and scorers mlflow/genai"] Server["mlflow/server"] --> BackendStore Server --> ArtifactStore UI["server JS UI"] --> Server

The core operational distinction is between metadata and bulk artifacts. Run parameters, metrics, tags, model versions, prompt registry entries, trace indexes, and assessments belong to backend stores. Model files, datasets, logs, media, and large trace attachments belong to artifact stores. If a production design does not name both stores and their backup policies, it is incomplete.

sequenceDiagram participant App as Training or LLM app participant Client as MLflow client participant Server as Tracking server participant Meta as Backend store participant Art as Artifact store participant Eval as GenAI scorer or judge App->>Client: log run, model, prompt, or trace Client->>Server: REST tracking request Server->>Meta: write params, metrics, trace metadata Server->>Art: upload artifacts or trace attachments App->>Eval: evaluate output or trace Eval->>Client: log assessment Client->>Server: attach score to run or trace
flowchart TD Risk["Production risk"] --> Store["backend store migration"] Risk --> Artifact["artifact store permissions"] Risk --> Gateway["AI Gateway route"] Risk --> Trace["trace volume"] Risk --> Flavor["model flavor dependency drift"] Risk --> Judge["judge model cost and variance"] Store --> S1["runs and registry metadata unavailable"] Artifact --> A1["model loads but files cannot be fetched"] Gateway --> G1["provider key or rate limit failure"] Trace --> T1["large payloads require archival policy"] Flavor --> F1["logged model not reproducible"] Judge --> J1["evaluation score changes across model versions"]

Production Readiness Checklist

Senior Architect Reading Path

Start with mlflow/tracking/ and mlflow/server/ to understand API and server shape. Then read mlflow/store/ to separate backend metadata stores from artifact repositories. Move to mlflow/models/ and selected flavor packages for model packaging. Only then read mlflow/tracing/, mlflow/genai/, mlflow/evaluation/, and mlflow/gateway/ to understand modern LLMOps features on top of the tracking substrate. Finish with charts/, docker/, and tests/ to connect source behavior to deployment and compatibility checks.

Operational Scenarios to Rehearse

Validate MLflow with workflows that cross subsystem boundaries. Log a run with parameters, metrics, a model artifact, and a dataset, then restore it from a different runtime to prove backend and artifact stores both work. Log a GenAI trace with attachments and assessments, then test archival and search behavior under retention limits. Route a gateway request through at least two providers with budget controls enabled, then observe how provider errors, guardrails, and tracking records appear in the same operational view.