AI Solution Architecture

Deep Dives

View source

Phoenix Architecture Notes

Executive summary

Phoenix is an open source AI observability and evaluation platform from Arize. The repository describes Phoenix as a platform for tracing, evaluation, datasets, experiments, playground prompt iteration, and prompt management. The implementation is a mixed Python and TypeScript product: src/phoenix/ contains the server, tracing, database, datasets, sessions, metrics, and utilities; app/ and frontend assets provide the web UI; packages/ contains Python subpackages such as phoenix-client, phoenix-evals, and phoenix-otel; js/packages/ contains TypeScript packages such as phoenix-otel, phoenix-client, phoenix-evals, phoenix-mcp, and phoenix-cli.

The root pyproject.toml identifies the package as arize-phoenix, licensed under Elastic-2.0, requiring Python >=3.10,<3.15, and depending on FastAPI, Starlette, Uvicorn, Strawberry GraphQL, SQLAlchemy async, Alembic, OpenTelemetry, OpenInference semantic conventions, gRPC, Prometheus, Authlib, LDAP, and Phoenix client/eval/OTel packages. The main server can run locally, in notebooks, in containers, or in Kubernetes. The included Docker, Compose, Helm, and Kustomize files make the intended topology clear: Phoenix web/API on port 6006, OTLP gRPC ingestion on port 4317, and SQLite or PostgreSQL persistence.

Problem solved

Phoenix solves the AI debugging loop around traces, evaluations, and experiments. For LLM and agent systems, generic APM often captures latency but not prompt variables, retrieval documents, token/cost details, tool calls, evaluator labels, dataset examples, or experiment lineage. Phoenix receives OpenTelemetry/OpenInference spans, stores them in an AI-aware schema, exposes trace and span analysis in the UI, and connects that telemetry to datasets, experiments, playground runs, code evaluators, LLM-as-judge evaluators, and prompt versions.

AI stack role

Phoenix is an observability and evaluation workbench in the AI stack:

Source tree map

Repository evidence:

Core concepts

Internal architecture

graph TB Apps[AI apps and OpenInference instrumentations] --> OTLP[OTLP gRPC and HTTP ingestion] SDK[Python and TypeScript Phoenix clients] --> REST[FastAPI REST API] Browser[Web UI] --> GQL[Strawberry GraphQL] Browser --> REST OTLP --> Decode[decode_otlp_span and trace schemas] Decode --> Bulk[BulkInserter and Facilitator] REST --> DBSession[Async SQLAlchemy sessions] GQL --> Loaders[GraphQL DataLoaders] Loaders --> DBSession Bulk --> DB[(SQLite or PostgreSQL)] DBSession --> DB Daemons[Experiment runner, sweepers, cost calculator, disk monitor] --> DB Packages[phoenix-evals and phoenix-client] --> REST MCP[phoenix-mcp and phoenix-cli] --> REST

src/phoenix/server/app.py is the architectural hub. It imports FastAPI, GraphQL router support, gRPC interceptors, data loaders, BulkInserter, Facilitator, GrpcServer, DbDiskUsageMonitor, ExperimentRunner, ExperimentSweeper, GenerativeModelStore, SpanCostCalculator, TraceDataSweeper, authentication backends, redaction, encryption, sandbox session management, and OpenTelemetry server instrumentation. This is not a thin API wrapper; it is the composition root for the product.

The database layer is asynchronous and migration-aware. src/phoenix/db/alembic.ini and src/phoenix/db/migrations/ indicate schema migration discipline. src/phoenix/db/aws_auth.py shows cloud-specific database authentication support. Phoenix can use local SQLite for light deployments or PostgreSQL for durable multi-user/self-host deployments.

The trace layer is deliberately OpenTelemetry-aligned. src/phoenix/trace/attributes.py documents flattening and unflattening OTEL attributes while preserving nested structures, and src/phoenix/trace/schemas.py defines span data structures inspired by OpenTelemetry.

Runtime and data flow

sequenceDiagram participant App as Instrumented AI app participant OTel as OpenInference or Phoenix OTel SDK participant GRPC as Phoenix OTLP gRPC 4317 participant Server as Phoenix server 6006 participant DB as SQLite or PostgreSQL participant UI as Phoenix UI participant Eval as phoenix-evals or code evaluator App->>OTel: Create spans with AI semantic attributes OTel->>GRPC: Export OTLP spans GRPC->>Server: Decode and enqueue span insertion Server->>DB: Bulk insert traces, spans, costs, annotations UI->>Server: Query GraphQL and REST resources Server->>DB: Load traces, datasets, prompts, experiments Eval->>Server: Submit annotations or experiment results UI->>Server: Compare traces, datasets, prompt variants

The ingestion path is optimized for observability payloads rather than generic metrics. Phoenix receives OpenTelemetry spans, decodes them into Phoenix trace models, inserts them through the database facilitator/bulk inserter, then exposes them through GraphQL dataloaders and REST resources. Evaluation and experiment workflows can originate from the UI, client packages, or agent tooling; their results land back in the database as annotations, experiment runs, or evaluator outputs.

Deployment and operations topology

graph LR subgraph Sources Py[Python apps] TS[TypeScript apps] Agents[Coding agents] end subgraph PhoenixPod Server[Phoenix web and API port 6006] Collector[OTLP gRPC port 4317] Sandbox[Optional evaluator sandbox] end subgraph Persistence SQLite[(SQLite volume)] PG[(PostgreSQL)] end subgraph Ops Prom[Prometheus scrape] OTelCollector[External OTLP collector] SMTP[SMTP] IdP[OAuth2 OIDC or LDAP] end Py --> Collector TS --> Collector Agents --> Server Server --> SQLite Server --> PG Collector --> Server Server --> Prom Server --> OTelCollector Server --> SMTP Server --> IdP Sandbox --> Server

The simplest Compose topology runs phoenix and db, maps 6006:6006 and 4317:4317, and sets PHOENIX_SQL_DATABASE_URL=postgresql://postgres:postgres@db:5432/postgres. Kubernetes options are more mature: helm/README.md and helm/values.yaml expose authentication, CORS, CSRF trusted origins, brute-force login protection, OAuth2/OIDC, LDAP, PostgreSQL settings, read replicas, retention policy, health checks, server host/port/root URL, PHOENIX_MAX_SPANS_QUEUE_SIZE, OTLP instrumentation export endpoints, SMTP, TLS, and sandbox provider allowlists. kustomize/base/phoenix.yaml includes Prometheus scrape annotations and readiness probes.

Lifecycle and decision diagram

stateDiagram-v2 [*] --> Capture Capture --> StoreTrace StoreTrace --> Inspect Inspect --> DatasetDecision DatasetDecision --> AddToDataset: useful example DatasetDecision --> DebugOnly: one-off issue AddToDataset --> RunExperiment RunExperiment --> Evaluate Evaluate --> Compare Compare --> PromotePrompt: quality improved Compare --> RevisePrompt: regression found PromotePrompt --> Capture RevisePrompt --> RunExperiment DebugOnly --> [*]

Phoenix is strongest when traces are not the end of the workflow. A production trace can become a dataset example, a dataset can drive experiments, an experiment can be scored by evaluators, and scored outputs can guide prompt or model changes. The repo reflects that loop through src/phoenix/trace/, src/phoenix/datasets/, GraphQL dataloaders for experiment and dataset state, and packages/phoenix-evals/.

Extension points

Integrations

The README lists broad framework and provider support through OpenInference: OpenAI Agents SDK, Claude Agent SDK, LangGraph, Vercel AI SDK, Mastra, CrewAI, LlamaIndex, DSPy, OpenAI, Anthropic, Google GenAI, Google ADK, Bedrock, OpenRouter, LiteLLM, and more. Phoenix also provides Python subpackages for OTel, client, and evals; TypeScript packages for OTel, client, evals, MCP, and CLI; and coding-agent skills in .agents/skills/.

The package design shows a deliberate split: the server owns storage and UI, phoenix-otel owns instrumentation defaults, phoenix-client owns API interaction, phoenix-evals owns metric execution, and phoenix-mcp/phoenix-cli expose Phoenix data to agent workflows.

Configuration, deployment, and operations

Important configuration families:

Operationally, the highest-risk capacity setting is PHOENIX_MAX_SPANS_QUEUE_SIZE; Helm notes that queued spans consume memory and should be sized against database throughput. Watch database disk usage, migration status, span queue rejections, experiment runner health, sandbox availability, and auth configuration drift.

Observability, testing, evaluation, and failure modes

Phoenix includes unit, integration, and package tests across tests/, packages/phoenix-client/tests/, packages/phoenix-evals/tests/, and js/packages/*/test/. Test names cover client resources for traces/spans/sessions/datasets/experiments, evaluator prompts and adapters, rate limiters, concurrency controllers, OTel registration, MCP trace/span/project/dataset utilities, and ATIF trajectory conversion.

Failure modes to design for:

Security and governance risks

Phoenix stores AI telemetry, evaluator outputs, prompt versions, dataset examples, annotations, and provider configuration. Security controls should include authentication in shared environments, SSO/OIDC or LDAP where appropriate, strong PHOENIX_SECRET, TLS for HTTP and gRPC, database network isolation, encrypted secrets, redaction rules, trace retention, least-privilege database credentials, and sandbox restrictions for code evaluators.

Governance teams should treat datasets and experiments as quality records. If evaluator definitions change, results should be interpreted by evaluator version. If prompt versions are promoted from playground work, teams should preserve trace links and experiment evidence.

Reading guide

  1. Read README.md for the product workflow and integrations.
  2. Read pyproject.toml for dependencies, package entrypoints, and optional extras.
  3. Read src/phoenix/server/app.py to understand the application composition root.
  4. Read src/phoenix/trace/attributes.py, schemas.py, and trace_dataset.py for trace representation.
  5. Read src/phoenix/db/, especially models, migrations, bulk inserter, and facilitator.
  6. Read packages/phoenix-otel/, packages/phoenix-client/, and packages/phoenix-evals/ for external SDK roles.
  7. Read helm/README.md, helm/values.yaml, and kustomize/base/phoenix.yaml for production configuration.
  8. Use tests in packages/ and js/packages/ to understand edge cases.

Learning path

  1. Start a mental trace from OpenInference instrumentation to OTLP gRPC ingestion.
  2. Follow how a span becomes a database record and then a GraphQL UI query.
  3. Learn dataset and experiment concepts from README, src/phoenix/datasets/, and dataloaders.
  4. Study packages/phoenix-evals to understand LLM judge and classification evaluator construction.
  5. Study js/packages/phoenix-mcp if you want Phoenix data available to coding agents.
  6. Review Helm security settings before using Phoenix outside a local environment.

Glossary

Repository-Grounded Deep Dive

Phoenix is best understood as an OpenTelemetry-native LLMOps application with a Python server, a TypeScript UI, and separate evaluator/client packages. The core runtime lives under github-repos/05-observability-evaluation-llmops/phoenix/src/phoenix/, with important boundaries in server/, trace/, db/, datasets/, and metrics/. The UI is under app/src/. Evaluator packages live in packages/phoenix-evals/ and JavaScript client/OTEL packages live under js/packages/phoenix-client/ and js/packages/phoenix-otel/. Deployment examples are visible in kustomize/, helm/, and scripts/docker/devops/.

flowchart LR App["LLM app with OpenInference instrumentation"] --> OTLP["OTLP HTTP or gRPC ingestion"] OTLP --> Server["Phoenix server src/phoenix/server"] Server --> Trace["trace normalization src/phoenix/trace"] Trace --> DB["database layer src/phoenix/db"] DB --> GraphQL["GraphQL resolvers and dataloaders"] GraphQL --> UI["React UI app/src"] DB --> Datasets["datasets and experiments"] EvalPkg["packages/phoenix-evals"] --> Datasets ClientPkg["js/packages/phoenix-client"] --> Server

The architectural split matters because Phoenix is both an ingestion system and an analysis workbench. OTLP/OpenInference spans are the raw evidence. GraphQL and DataLoader logic shape that evidence for UI exploration. Datasets, experiments, annotations, prompt versions, and evaluators convert observations into regression assets. If the team only monitors span ingestion, it can miss the health of experiment comparison, evaluator execution, prompt release, or dataset mutation paths.

sequenceDiagram participant Inst as Instrumented app participant OTLP as Phoenix OTLP endpoint participant Srv as Phoenix server participant DB as DB session and migrations participant UI as GraphQL UI participant Eval as Evaluator package Inst->>OTLP: export spans with OpenInference attributes OTLP->>Srv: parse and normalize spans Srv->>DB: persist traces, spans, annotations UI->>DB: batched GraphQL reads through dataloaders UI->>Eval: request experiment or evaluator workflow Eval->>Srv: submit scores or experiment results Srv->>DB: attach evaluations to datasets or spans
flowchart TD Risk["Operational risk"] --> Ingest["OTLP schema mismatch"] Risk --> DB["database migration drift"] Risk --> Eval["LLM judge nondeterminism"] Risk --> Sandbox["code evaluator sandbox"] Risk --> UI["GraphQL hot query"] Risk --> Auth["auth and reverse proxy"] Ingest --> I1["span attributes fail downstream filters"] DB --> D1["experiments or annotations cannot load"] Eval --> E1["score distributions shift after prompt/model change"] Sandbox --> S1["untrusted code needs isolation and limits"] UI --> U1["dataloader batching hides expensive access pattern"] Auth --> A1["headers or base path break behind proxy"]

Production Readiness Checklist

Senior Architect Reading Path

Read src/phoenix/server/ first to understand ingestion and API boundaries. Then read src/phoenix/trace/ and src/phoenix/db/ to see how spans become queryable records. Move to app/src/ for the UI concepts, especially trace and experiment screens. Finally, inspect packages/phoenix-evals/, packages/phoenix-otel/, js/packages/phoenix-client/, and the deployment manifests. This order keeps telemetry, persistence, product workflow, SDK behavior, and operations separate.

Operational Scenarios to Rehearse

Run one scenario for each Phoenix plane. For ingestion, export traces from a framework integration and confirm OpenInference attributes survive OTLP parsing, persistence, GraphQL reads, and UI rendering. For evaluation, create a dataset, run an experiment with a pinned evaluator, and verify that annotations and scores remain comparable after a server restart. For operations, place Phoenix behind the reverse-proxy examples, then check base URL handling, auth headers, TLS termination, and large trace pagination.