Guide AI agents ~13 min read Updated June 26, 2026

LLM Observability Explained: What to Instrument, What to Ignore, and How to Start

LLM observability instruments AI agent applications to capture the signals that explain failures: token usage, latency, tool-call error rates, and retry counts. A model call can return HTTP 200 with valid JSON and still represent a failure, and traditional monitoring never sees it.

TL;DR

20 seconds. LLM observability exposes four behavioral signals that traditional APM misses: tool-call error rate, p95 latency, token spend per task, and retry count. A request that returns 200 OK can hide a retry loop consuming 3–5x the expected token budget, a malformed output that passes schema validation, or a plan revision that burns tokens silently.

60 seconds. Production LLM observability requires two distinct layers: the model call layer (Langfuse, Phoenix, Helicone) captures what the model did and what it cost; the application layer captures what broke in the code around it. The OpenTelemetry GenAI semantic conventions standardize model call attributes, including token counts, finish reasons, tool call names, and agent identity, but they explicitly exclude application exceptions, eval scores, and deployment regression tracking. Of the four behavioral signals, token spend per task is the one most likely to surface a looping agent before latency or error rates react: alert at baseline times 1.5. urgentry covers the application layer, accepting OTLP/HTTP on the same Collector pipeline that already routes GenAI spans to Langfuse or Phoenix, with no separate instrumentation pass required.

This guide covers why LLM observability is not just observability with a new label, the two-layer architecture, the four signals worth instrumenting, what to ignore, the OTel GenAI gap analysis, MCP servers as a specific visibility gap, how to start, and where urgentry fits.

Why LLM observability is not just observability with a new label

Traditional observability answers three questions: did the service return the right status code, how long did it take, and which component failed? Those questions assume deterministic code. Given identical inputs, a function returns identical outputs, and a failure surfaces as an exception or a non-200 response.

LLM-powered systems break that assumption at every layer.

The model is a probabilistic component. The same prompt can produce different outputs across calls. A failure may never raise an exception; it may return a JSON object with subtly wrong values that passes schema validation and propagates silently into a database write. Tool calls made by an agent can fail, trigger automatic retries, and burn 3–5x the expected token budget while the system reports 200 OK throughout.

Five failure modes appear consistently in production AI agents, none of which traditional APM captures:

  1. Malformed model output causes an internal parse failure without surfacing as an exception. The agent catches the JSON decode error, returns a sensible-looking default, and the calling code never knows.
  2. Tool calls returning error statuses trigger automatic retries. Each retry consumes tokens. The system completes successfully with a 3x token bill.
  3. Broken plan revisions occur when the agent repeats a failed sequence with minor variations, consuming tokens in a loop before either giving up or producing degraded output.
  4. Cost overruns consume 3–5x the expected token budget on a single task. Nothing in the response signals this unless you are tracking per-task token spend.
  5. Silent low-quality output completes with no exception and no error status, but produces an incorrect result.

Failure modes 1 through 4 are detectable with the right instrumentation. Failure mode 5 requires an evaluation layer that sits outside the observability stack entirely, a distinction covered below.

The two-layer architecture

LLM observability in production requires two distinct instrumentation layers. Conflating them is the most common instrumentation mistake, and it leads teams to instrument heavily, gain little, and conclude that LLM observability is harder than it actually is.

Layer 1: the model call layer. This is what tools like Langfuse, Phoenix (Arize), and Helicone handle well. It captures what the model did: the prompt sent, the completion received, token counts via gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, latency per call, the model identifier from gen_ai.request.model, and the finish reason from gen_ai.response.finish_reasons. The OpenTelemetry GenAI semantic conventions, now maintained at github.com/open-telemetry/semantic-conventions-genai, define the standard attribute names for this layer, with provider-specific coverage for OpenAI, Anthropic, AWS Bedrock, and Azure AI Inference.

Layer 2: the application layer. This is what the model call layer does not cover: unhandled exceptions in the code around the model call, grouped error fingerprinting, request context, and deployment-level regressions. An urgentry guide on self-hosted LLM observability states the distinction directly: "LLM observability tools answer one question well: what did the model do, and what did it cost? They answer a second question badly or not at all: what broke in the code around the model call?" The guide on tracking AI agent errors covers the application-layer patterns in detail.

The two layers connect through W3C traceparent headers. When an exception fires in the application layer, the active span ID links the error event to the corresponding model trace in the LLM observability tool. Without both layers instrumented and passing that header, cross-platform navigation from an urgentry error to its Langfuse trace is not possible.

The four signals worth instrumenting

Of the signals available to you, four carry most of the diagnostic weight for agent failures. The rest is noise at varying price points.

Tool-call error rate. This is the fastest indicator of agent degradation. Segment it by tool name. An error rate spike on one specific tool (a database lookup, an API call, a file read) isolates the failure before latency and cost metrics register a change. Urgentry’s degradation guide puts the alert threshold at baseline plus two standard deviations, evaluated over a rolling window.

End-to-end latency at p95. Mean latency hides the failure mode where an agent succeeds on 95% of tasks in two seconds but spends 45 seconds on the remaining 5% because it entered a retry loop. P95 captures that tail. Alert at 30 seconds p95 for most agent task types; adjust for your workload’s expected duration.

Token spend per task. Track this as a per-transaction budget, not a raw volume aggregate. A task that consumed 800 input tokens at baseline but now consistently consumes 3,200 is looping. The metric is assembled from gen_ai.usage.input_tokens and gen_ai.usage.output_tokens attributes on model call spans, summed across all spans in a task trace. One implementation detail matters: set token counts after the model call returns, not before. If set before, they are absent on spans that time out or throw exceptions, which is precisely when you need them. Alert at baseline times 1.5.

Retry count per task. Store this as a span attribute (agent.retry_count) on the task-level parent span. A task that succeeds after four retries has a different cost and reliability story than one that succeeds on the first attempt. Aggregating them without this attribute makes both invisible. This attribute also catches failure mode 3 (broken plan revisions) before it becomes a cost overrun.

These four signals are behavioral: they measure what the agent did, independent of whether the output was correct. The guide on measuring agent degradation with OTel covers the span hierarchy and attribute-setting patterns in detail.

What to ignore

The instrumentation decisions that consume the most engineering time often produce the least diagnostic signal.

Raw prompt and completion text. Storing full prompt and completion content in your observability backend creates a large PII surface, inflates storage costs, and adds minimal debugging value beyond what structured attributes provide. The signal you actually need is available without the text: token counts tell you cost, finish reasons tell you whether the model stopped normally or hit a limit, and schema validation results tell you whether the output was parseable. Capture structural metadata instead: string length, schema validation result, argument hashes for repetition detection.

Inline production evals as a primary alerting mechanism. Running automated eval scores against every production request is computationally expensive, introduces latency into the request path, and produces results that are only as reliable as your test set and judge model. Eval scores belong in a separate evaluation pipeline with dedicated infrastructure. The four behavioral signals above are your leading indicators that something has changed; evals confirm whether it matters and belong in the investigation workflow, not the alert path.

Span trees without exception capture. Framework auto-instrumentation for LangChain, LlamaIndex, and similar libraries generates span trees that show what the agent called and in what sequence. That data is useful. It becomes significantly more useful when spans include recorded exceptions, not just that a tool call failed, but the stack trace that explains why. The span.record_exception(exc) pattern, applied inside tool handlers before re-raising or returning a default, converts a timing trace into a debugging artifact. Without it, you have a timeline with no causes.

AI-authored code deserves specific attention here. Urgentry’s guide on swallowed exceptions in agent-generated backends documents four antipatterns that AI code disproportionately produces: empty except blocks that discard errors silently, bare except Exception blocks that return sensible-looking defaults, retry decorators that return None after exhaustion without alerting, and handlers that return HTTP 200 despite catching exceptions. All four are invisible to status-code-based monitoring and all four are correctable with exception capture inside the handler.

The OTel GenAI gap analysis

The OpenTelemetry GenAI semantic conventions, as of mid-2026 maintained at github.com/open-telemetry/semantic-conventions-genai, define the standard vocabulary for the model call layer. The attribute set covers: request parameters (gen_ai.request.model, gen_ai.request.max_tokens, gen_ai.request.temperature, gen_ai.request.top_p), response metadata (gen_ai.response.model, gen_ai.response.finish_reasons, gen_ai.response.id), token usage (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens), tool calls (gen_ai.tool.name), agent identity (gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description), conversation tracking (gen_ai.conversation.id, gen_ai.workflow.name), and retrieval (gen_ai.retrieval.documents). Provider coverage includes OpenAI, Anthropic, AWS Bedrock, and Azure AI Inference, plus MCP.

The spec is actively evolving. As of 2026-06-26 the repository had 119 open issues and 32 open pull requests, which is the signature of an engaged working group, not a neglected spec.

What the GenAI conventions do not cover, and are not designed to cover:

Application exceptions. Stack traces, exception types, breadcrumbs, and the grouping and fingerprinting logic that converts a stream of individual exception events into actionable issues. This is the domain of Sentry-protocol error trackers. The GenAI spec defines no attributes for exception message, traceback, or error fingerprint.

Eval scores. The conventions include no standard attribute for human ratings or automated evaluation output. These are application-specific values with no standardized schema.

Plan-level reasoning failures. An agent that silently revises its plan after a tool failure is not detectable from the span tree unless the revision is explicitly instrumented as an attribute. The spec has no standard attribute for plan retry events or reasoning chain failures.

Deployment regression tracking. Comparing error rates and token spend across releases requires deploy tagging, which is outside the scope of the GenAI conventions but standard in error-tracking tools.

Prompt engineering state. Which prompt template version produced which completion is not tracked by the spec. Teams that run A/B tests on prompt wording need to add this as a custom attribute.

The practical consequence: teams that instrument only the GenAI layer have full visibility into model calls and zero visibility into what breaks in the code around them. The relationship between these signal types is covered in the logs vs traces vs errors guide.

MCP servers: a specific visibility gap

MCP (Model Context Protocol) servers require separate mention because they introduce an architectural visibility constraint that the OTel GenAI conventions do not resolve.

Neither the Python nor the TypeScript MCP SDK ships with built-in OpenTelemetry support as of mid-2026. Instrumentation requires wrapping tool handlers manually: start a span when the handler begins, set attributes for the tool name and structural metadata about the arguments (not the values themselves, to avoid PII), and call span.record_exception(exc) before ending the span. Configure OTLP export to port 4318 via environment variables.

The architectural constraint: the MCP server sees the tool call but not the prompt that produced it. The agent host knows the prompt context but not the internal execution details of the tool. Complete observability requires instrumentation at both layers, with the W3C traceparent header linking them. A span at the MCP server without a corresponding trace at the agent host tells you that a tool ran and how long it took. It does not tell you what the agent was trying to accomplish when it called the tool, or whether the call was a first attempt or a fourth retry.

The MCP server observability guide covers the wrapping pattern and the argument-metadata approach in Python and TypeScript.

How to start

Three steps, in sequence.

  1. Instrument the model call layer. Use the auto-instrumentation package for your framework (opentelemetry-instrumentation-langchain for LangChain; equivalent packages exist for LlamaIndex, the OpenAI SDK, and the Anthropic SDK). Configure the OTLP exporter to your chosen backend on port 4318. Set gen_ai.usage.input_tokens and gen_ai.usage.output_tokens after the model call returns. This gives you token cost and latency data within a single session.
  2. Add exception capture inside tool handlers. Wrap each tool function with a span. Call span.record_exception(exc) inside every except block before re-raising or returning a default. If you are routing application exceptions to an error tracker alongside the model call layer, wire the Sentry SDK at this step: it routes exceptions to a grouped issues view rather than a raw trace attribute, and the traceparent header connects the two.
  3. Add behavioral span attributes on the task-level parent span. Add agent.retry_count, agent.tool_error_count, and a total token budget attribute. Set them incrementally as the task executes, not only at the end, so they appear on spans that time out or throw before completion. These attributes produce the four behavioral signals described above.

For MCP servers, the same three steps apply with the manual wrapping pattern. No changes to the MCP server binary are needed for the transport layer.

Where urgentry fits

Urgentry covers the application layer: exception capture, grouping by fingerprint, and deployment-level regression tracking for AI agent codebases. It accepts OTLP/HTTP at port 4318. The same OpenTelemetry Collector that routes GenAI spans to Langfuse or Phoenix can fan out exceptions to urgentry without a second instrumentation pass. The self-hosted LLM observability guide covers the two routing patterns: OTel Collector fan-out (single export point duplicated to both backends) and split routing (GenAI spans to Langfuse, exceptions to urgentry via Sentry SDK).

What urgentry does not replace: the model call layer. Langfuse, Phoenix, and Helicone handle that well, and none of them are redundant with error tracking. The two-layer architecture is not urgentry’s framing of the market; it reflects what different signal types are, and neither layer makes the other unnecessary.

The binary runs at roughly 52 MB peak memory in Tiny mode and handles 400 events per second on a $5–$20/month VPS. If you are already self-hosting error tracking with urgentry for a non-agent codebase, extending it to cover agent exceptions requires one additional OTLP export destination and the exception-capture patterns in Step 2 above. No separate binary, no additional service.

Frequently asked questions

What is LLM observability?

LLM observability is the practice of instrumenting large-language-model-powered applications to capture the signals that explain failures: token usage, latency, tool-call error rates, retry counts, and application exceptions. It differs from traditional observability because a model call can return HTTP 200 with valid JSON and still represent a failure, through malformed semantic output, excessive token spend, or a silent plan loop that never raises an exception.

How does LLM observability differ from traditional observability?

Traditional observability assumes deterministic code: the same input produces the same output, and failures surface as exceptions or non-200 status codes. LLM systems are probabilistic. A model call always succeeds at the transport layer; failure is semantic or economic. That requires a different signal set: token spend per task, retry rate, output schema validation results, and application exceptions inside tool handlers. Standard APM captures none of these by default.

What is AI agent observability?

AI agent observability extends LLM observability to multi-step systems where a model repeatedly calls tools, revises plans, and makes decisions across a session. The additional signals: tool-call error rate segmented by tool name, retry count per task, end-to-end latency at p95, and the application exceptions that fire inside tool handlers. Agent failures typically surface through cost overruns and retry loops before they surface through error rates, which is why token spend per task and retry count are leading indicators.

What do the OpenTelemetry GenAI semantic conventions cover?

The OTel GenAI conventions, maintained at github.com/open-telemetry/semantic-conventions-genai, define standard attributes for model call spans: request parameters (gen_ai.request.model, gen_ai.request.max_tokens), response metadata, token counts (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens), tool call names (gen_ai.tool.name), agent identity (gen_ai.agent.id, gen_ai.agent.name), and conversation tracking (gen_ai.conversation.id). Provider coverage includes OpenAI, Anthropic, AWS Bedrock, and Azure AI Inference, plus MCP. The spec does not cover application exceptions, eval scores, or deployment regression tracking. As of mid-2026 the spec is actively evolving in a dedicated repository.

What should I not instrument in an LLM application?

Skip raw prompt and completion text in your observability backend; the structural metadata (token counts, finish reason, schema validation result) gives you the debugging signal without the PII risk and storage cost. Skip inline production evals as a primary alerting mechanism; they belong in a separate evaluation pipeline, not the production telemetry path. Skip span trees without exception capture inside tool handlers; framework auto-instrumentation without span.record_exception(exc) produces timing data but not debugging artifacts, which is the more useful half.

What tools are used for LLM observability?

The space splits by layer. Model call layer: Langfuse (MIT license, self-hostable on Langfuse v3.22.0+), Phoenix (open source, from Arize), Helicone (SaaS). Application error layer: Sentry-compatible error trackers including urgentry (self-hosted, single binary, Sentry API-compatible), GlitchTip, and Bugsink. The two layers share data through W3C traceparent headers and an OTLP Collector that routes spans to both destinations. Using only the model call layer leaves application exceptions invisible; using only the error layer leaves token cost and model behavior invisible.

Sources

  1. OpenTelemetry GenAI Semantic Conventions repository — active spec as of mid-2026, accessed 2026-06-26.
  2. OpenTelemetry GenAI Attributes Registry — canonical listing of standard gen_ai.* attribute names, accessed 2026-06-26.
  3. urgentry: "Self-Hosted LLM Observability" — two routing patterns for fan-out to Langfuse and urgentry from a single OTLP Collector.
  4. urgentry: "AI Agent Error Tracking" — application-layer patterns for span hierarchies, tool-handler exception capture, and fingerprinting.
  5. urgentry: "Measuring Agent Degradation with OTel" — span hierarchy, attribute-setting order, and alert threshold methodology.
  6. urgentry: "MCP Server Observability" — manual wrapping pattern for Python and TypeScript MCP SDK tool handlers.
  7. urgentry: "Agent-Introduced Bugs: Swallowed Exceptions" — four antipatterns AI-generated code produces that are invisible to status-code monitoring.
  8. urgentry: "Claude Code OTLP" — OTLP export configuration for Claude Code agent workloads, accessed 2026-06-26.
  9. urgentry: "Open-Source Datadog MCP Alternative" — self-hosted observability options for MCP server environments, accessed 2026-06-26.

Start with the application layer of LLM observability.

urgentry is a single Go binary that accepts OTLP/HTTP at port 4318 and the Sentry envelope protocol on a backend you own. It captures agent and application-layer failures with exception grouping, fingerprinting, and deployment regression tracking, running at roughly 52 MB peak memory on a $5 VPS.