Guide AI agents & MCP observability ~9 min read Updated April 7, 2026

Is my AI agent getting worse? Measuring degradation with OpenTelemetry.

Agent degradation rarely announces itself. A model rollout changes function-calling format; tool calls start failing. A user starts injecting unusual phrases; the agent chooses wrong tools. A prompt tweak that looked fine in manual testing shifts error rates in production. This guide covers how to detect all of those shifts with OpenTelemetry spans, why telemetry alone cannot tell you whether the agent's answers got worse, and how to add the quality signal that closes the gap.

TL;DR

20 seconds. Telemetry shows agent behavior: tool-call error rate, end-to-end latency, token spend per task, retry rate. It does not show agent quality: whether the output was correct, useful, or safe. Four behavior signals are measurable from OTel spans today. Two quality signals, human ratings and automated eval scores, require separate instrumentation and must be emitted alongside span data to close the gap.

60 seconds. Instrument your agent with spans on agent.invoke, child spans on each tool call, token attributes on model calls, and ERROR status on any tool failure. Point the OTel exporter at urgentry's OTLP/HTTP endpoint on port 4318. Set four alerts: tool-call error rate above baseline plus two standard deviations, p95 latency above 30 seconds, eval-score delta above five percentage points, and token-per-task above baseline times 1.5. Run the eval suite on every deploy and emit the score as a metric. Now you have a backstop that fires before users file tickets.

This guide is honest about what observability cannot do. Telemetry tells you the agent ran and how it behaved. It does not tell you whether the agent's output was any good. The eval signal described in section six is outside the scope of what OpenTelemetry defines, and there is no consensus yet on a standard format for it. What this guide offers is a workable approach that teams are using in production today, along with a clear statement of where it ends and where human judgment begins.

What "getting worse" actually means for an agent

Degradation is not a single phenomenon. Five distinct causes produce superficially similar symptoms, and they require different fixes.

Drift in upstream model output. Model providers update models continuously. A provider may ship a new checkpoint under the same model name, or change the default sampling parameters, or alter how the model formats function call arguments. Your agent's behavior changes without any change on your side. The model still returns valid JSON, but the argument structure differs from what your tool schemas expect. Tool calls start failing at a rate that looks like a bug you introduced, but the source is upstream.

Prompt injection in user input. Users discover that certain phrases shift the agent's behavior: instructions embedded in user messages that override the system prompt's intent, tool descriptions that contain embedded commands, or document content the agent reads that contains adversarial instructions. The agent starts calling tools it should not call, skipping steps it should take, or producing outputs that serve the injected instruction rather than the user's stated goal. Span data shows unusual tool call patterns; span data alone does not tell you the cause.

Tool API behavior change. An external API your agent calls changes its response format, authentication requirements, or rate limiting policy. The tool handler starts receiving unexpected responses, raising validation errors, or hitting rate limits it did not hit before. From the agent's perspective, a reliable tool became unreliable. Tool-call error rate rises; task completion rate drops.

Your own prompt regression. A prompt edit that improves one task type silently degrades another. A new tool added to the agent's tool list causes the model to select it in situations where a different tool would produce better results. A system prompt change that fixes a tone issue breaks a reasoning pattern the agent relied on. These regressions are invisible in span data until they affect behavior metrics. By then, users have already seen the degraded output.

Evaluation rubric mismatch. Your eval suite measures the wrong things. The rubric was written when the agent handled a narrow task type; the agent now handles a broader range. The test set does not reflect the current user input distribution. The automated judge model used in LLM-as-judge eval has its own biases that favor certain output styles over correctness. The eval score stays flat while the agent's real-world quality drops, because the metric and the reality have diverged.

Each of these causes has a different signature in span data. A model rollout shows up as a sudden tool-call error rate spike. A prompt regression shows up as a gradual drift in token-per-task or retry rate. A tool API change shows up as an error spike on a specific tool name. Knowing which cause you are looking for shapes which signal you instrument first.

What OTel spans show and what they don't

OpenTelemetry spans measure the mechanics of execution. For an agent, that means: did the tool call succeed, how long did the task take, how many tokens did the model consume, how many retries occurred. These are behavior signals. They tell you how the agent ran.

They do not tell you whether the agent's output was correct.

An agent can complete a task in twelve tool calls, within budget, with zero exceptions, and produce a wrong answer. The spans show all-green. The user receives a confident, well-formatted, factually incorrect response. No span attribute captures this. No metric fires. The task looks successful from every angle that telemetry can observe.

This is the gap that nobody has closed with telemetry alone. The gap exists because OpenTelemetry is a protocol for measuring software execution, not for judging the quality of natural-language output. The GenAI semantic conventions define attributes for token counts, model identity, and operation type. They define nothing for output correctness, task success, or user satisfaction. That territory belongs to evaluation frameworks, not to observability backends.

What this means in practice: span data is a necessary but insufficient condition for knowing whether your agent is healthy. You need span data to catch the behavior regressions that precede quality regressions. You need a separate quality signal to catch the quality regressions that span data misses. The two signals are complementary, not interchangeable.

The sections below cover both. Sections three through four describe what you can measure from spans. Sections five through six describe how to add the quality signal. Section ten covers what neither approach can do.

The four behavior signals you can measure today

Four signals from OTel spans cover the majority of actionable agent behavior regressions.

1. Tool-call error rate. Count spans representing tool calls where span status is ERROR, divided by total tool call spans, over a rolling window. This is the fastest-moving indicator of degradation. A model rollout that changes function-calling format produces a tool-call error rate spike within minutes. A broken external API produces a spike on a specific tool name. Segment the rate by tool.name to isolate which tool is failing, not just that something is failing.

2. End-to-end latency p95. The 95th percentile duration of the task-level span, from the first model call to the final output. Median latency hides tail behavior. An agent that completes most tasks in eight seconds but occasionally stalls for ninety seconds looks fine at the median. The p95 captures the stall. A rising p95 without a rising error rate often means the agent is entering retry loops that eventually succeed: the task completes, but it took four times as long as it should have.

3. Token spend per task. Sum gen_ai.usage.input_tokens and gen_ai.usage.output_tokens across all model call spans within a single task, grouped by task.id. Token spend per task, not raw volume, is the budget metric. Raw volume increases as you handle more tasks. Spend per task should stay flat as you scale. A rising spend-per-task baseline means the agent is consuming more tokens to complete the same work: the context window is growing with retry history, or the model is calling more tools per task than it used to.

4. Retry rate per task. Track the count of retries that occur within a single task: tool retries where the agent calls the same tool again after a failure, and model retries where the API call itself failed and was retried. Store this as a span attribute on the task-level span (agent.retry_count). A retry count of one is normal. A retry count above three on a single task signals a failure loop. Plot the p90 of retry count per task over time. A rising p90 means more tasks are hitting repeated failures, even if each individual failure resolves before causing an error status on the task span.

The two quality signals you have to send in separately

No OTel span attribute captures whether the agent's output was good. You have to generate that signal outside of normal span instrumentation and send it in as a separate measurement.

Human thumbs-up/down on the conversation. Route a sample of agent conversations to a human reviewer. The reviewer reads the conversation, checks whether the agent's final output met the user's stated goal, and records a binary rating or a score on a rubric. That score goes back into your telemetry as a custom span attribute on the task span (e.g., eval.human_score) or as a tagged metric event keyed to the task.id. Human ratings establish ground truth. They do not scale, and they introduce human latency between the agent running and the quality signal arriving, but they are the only signal that cannot be gamed by a model that has learned to look good on automated rubrics.

Automated eval score from a held-out test set. Write a test suite that covers the task types your agent handles in production. On every deploy, run the agent against the test suite and compute a pass rate. An automated judge, either a rubric-based function or an LLM-as-judge model, grades each output. Emit the pass rate as an OTel metric or a tagged span event so it lands alongside your other telemetry in urgentry. An eval score drop on a deploy is the fastest feedback loop you have for prompt regressions and model update incompatibilities that do not immediately produce tool-call error spikes.

Both signals have limitations. Human ratings are expensive and slow. Automated eval scores depend on the quality of the test set and the quality of the judge. A test set that does not cover the user input distribution will miss real regressions. A judge model with biases will award high scores to outputs that look confident rather than outputs that are correct. These limitations are not reasons to skip quality signals; they are reasons to invest in improving them over time, treating the test set as a living artifact and the judge calibration as an ongoing task.

Instrumenting an agent with OTel

The instrumentation pattern uses three span levels: a task span at the top, a model call span for each LLM request, and a tool span for each tool the agent calls.

Install the required packages:

pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-exporter-otlp-proto-http \
    opentelemetry-instrumentation-langchain

Initialize the SDK and instrument LangChain before any other imports run:

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
from opentelemetry.trace import StatusCode

exporter = OTLPSpanExporter(
    endpoint=os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] + "/v1/traces",
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Instrument LangChain before any agent or chain objects are created.
LangchainInstrumentor().instrument()

tracer = trace.get_tracer("my-agent")

Wrap each agent invocation with a task span that carries the attributes you will query for degradation detection:

def run_agent_task(
    task_id: str,
    task_type: str,
    user_query: str,
    agent_executor,
) -> str:
    with tracer.start_as_current_span("agent.invoke") as span:
        span.set_attribute("task.id", task_id)
        span.set_attribute("task.type", task_type)
        # Track the model version so you can correlate degradation
        # with model rollouts from the provider.
        span.set_attribute("gen_ai.request.model", os.environ.get("MODEL_NAME", "unknown"))
        retry_count = 0

        try:
            result = agent_executor.invoke({"input": user_query})
            span.set_attribute("agent.retry_count", retry_count)
            span.set_attribute("task.success", True)
            return result["output"]
        except Exception as exc:
            span.record_exception(exc)
            span.set_status(StatusCode.ERROR, str(exc))
            span.set_attribute("agent.retry_count", retry_count)
            raise

The LangChain instrumentor generates child spans automatically for each LLM call and each tool call inside the agent executor. Each LLM call span carries gen_ai.usage.input_tokens and gen_ai.usage.output_tokens. Each tool call span carries the tool name and sets status to ERROR if the tool raises an exception.

For custom tool functions that LangChain wraps, add explicit error handling to ensure the span status propagates correctly:

from langchain.tools import tool
from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("my-agent")

@tool
def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base."""
    # The LangChain instrumentor creates a span for this tool call.
    # Get the current span to set additional attributes.
    current_span = trace.get_current_span()
    current_span.set_attribute("tool.name", "search_knowledge_base")
    current_span.set_attribute("tool.query.length", len(query))

    try:
        results = _do_search(query)
        current_span.set_attribute("tool.result.count", len(results))
        return "\n".join(r["text"] for r in results)
    except Exception as exc:
        current_span.record_exception(exc)
        current_span.set_status(StatusCode.ERROR, str(exc))
        raise
Set token attributes after the response arrives

Token counts come from the model response object. Set gen_ai.usage.input_tokens and gen_ai.usage.output_tokens after the model call returns, inside the span block. If you set them before the call, they are absent on timeout or exception, which is exactly when you most want them for debugging.

Adding the eval signal

The eval signal has two parts: running the evaluation and emitting the score as telemetry.

Run the eval suite as part of your CI/CD pipeline on every deploy. A minimal eval suite for degradation detection needs at least twenty to thirty representative task inputs with expected outputs or pass/fail rubrics. The suite exercises the same task types the agent handles in production, including the edge cases that caused problems in the past.

A tool like promptfoo makes it straightforward to define test cases and run them against your agent:

# promptfoo.yaml
prompts:
  - id: agent-system-prompt
    raw: "{{ system_prompt }}"

providers:
  - id: my-agent
    config:
      endpoint: http://localhost:8080/invoke

tests:
  - description: "Summarizes a document correctly"
    vars:
      input: "Summarize the Q3 earnings report"
    assert:
      - type: contains
        value: "revenue"
      - type: llm-rubric
        value: "The summary covers the main financial figures and does not fabricate numbers"

  - description: "Calls the correct tool for a database query"
    vars:
      input: "How many users signed up last week?"
    assert:
      - type: javascript
        value: "output.tool_calls.includes('query_database')"

After the eval run completes, emit the pass rate as an OTel metric so it arrives at urgentry alongside your span data. This script runs in CI after the promptfoo run:

import os
import json
import subprocess
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter

# Set up the metrics exporter pointing at urgentry.
exporter = OTLPMetricExporter(
    endpoint=os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] + "/v1/metrics",
)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=5000)
provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)
meter = metrics.get_meter("eval-runner")

eval_score_gauge = meter.create_gauge(
    "agent.eval.score",
    description="Eval pass rate from the held-out test suite, 0.0 to 1.0",
    unit="1",
)

def run_eval_and_emit(deploy_id: str, model_name: str) -> float:
    # Run the eval suite (replace with your actual eval runner command).
    result = subprocess.run(
        ["promptfoo", "eval", "--output", "results.json"],
        capture_output=True,
        text=True,
    )
    with open("results.json") as f:
        data = json.load(f)

    total = data["stats"]["successes"] + data["stats"]["failures"]
    pass_rate = data["stats"]["successes"] / total if total > 0 else 0.0

    # Emit the score with tags that let you correlate with span data.
    eval_score_gauge.set(
        pass_rate,
        attributes={
            "deploy.id": deploy_id,
            "gen_ai.request.model": model_name,
            "eval.suite": "production-regression",
        },
    )
    return pass_rate

The gauge arrives in urgentry as a metric you can chart over time. A drop greater than five percentage points from the prior deploy baseline triggers the alert described in section eight.

Routing OTLP to urgentry

urgentry accepts OTLP/HTTP at port 4318, the same port as any standard OTLP receiver. It handles traces, metrics, and logs in the same binary. No Collector sidecar is required.

Set these environment variables before running your agent process or your eval runner:

export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-urgentry-host
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME=my-agent

Traces arrive at /v1/traces. Metrics arrive at /v1/metrics. Logs arrive at /v1/logs. The SDK appends the signal path automatically; set OTEL_EXPORTER_OTLP_ENDPOINT to the base URL only, without a path suffix.

In urgentry, agent spans appear under the Traces view. The agent.invoke spans form the root of each trace, with LLM call and tool call child spans visible in the waterfall. Eval score metrics appear under the Metrics view, queryable by deploy.id and gen_ai.request.model. You can view a latency spike and an eval score drop on the same timeline without switching backends.

For local development, urgentry runs as a single binary with no external dependencies:

curl -fsSL https://urgentry.com/install.sh | sh
./urgentry serve --role=all
# Traces: http://localhost:4318/v1/traces
# Metrics: http://localhost:4318/v1/metrics
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_SERVICE_NAME=my-agent-dev

The four alerts that catch real degradation

Dashboards show you what happened. Alerts fire before users file tickets. These four alert conditions cover the most common degradation scenarios.

1. Tool-call error rate above baseline plus two standard deviations. Compute your error rate baseline over a rolling seven-day window. Alert when the current five-minute error rate exceeds baseline plus two standard deviations. This adapts to natural variation in your agent's workload rather than firing on a fixed percentage. A model rollout that changes function-calling format produces a spike well above this threshold within minutes of the rollout.

2. p95 latency above 30 seconds. A task that takes more than 30 seconds at p95 is either entering a retry loop or calling a tool with runaway latency. Thirty seconds is a workable starting threshold for most agents; adjust based on your task complexity. Use the p95, not the mean: the mean for agent tasks is dominated by the fast path and hides tail behavior that affects a meaningful fraction of users.

3. Eval-score delta above five percentage points. On every deploy, compare the eval suite pass rate against the prior deploy baseline. Alert when the drop exceeds five percentage points. This fires on prompt regressions and model update incompatibilities that do not immediately affect behavior metrics. It gives you a quality-layer backstop that the behavior alerts cannot provide.

4. Token-per-task above baseline times 1.5. Compute the median token spend per task over a rolling seven-day window. Alert when any single task exceeds 1.5 times that median, or when the p90 over a rolling hour exceeds 1.5 times the baseline p90. A 1.5x multiplier catches retry loops that inflate token spend without necessarily producing task-level errors. An agent stuck retrying a tool three times before succeeding looks healthy on error rate but shows clearly here.

A worked degradation example

A team runs an agent that answers questions about customer orders by calling three tools: lookup_order, query_fulfillment_status, and send_notification. The agent runs in production, handling two hundred requests per hour.

On a Tuesday afternoon, the model provider ships a checkpoint update under the same model name. The update changes how the model formats function call arguments: previously it emitted order_id as a string; now it emits it as an integer. The lookup_order tool handler validates arguments against a JSON Schema that requires a string. Validation fails on every call.

Within four minutes of the rollout, the tool-call error rate alert fires: lookup_order error rate has jumped from 0.3% to 87%. The span data shows the argument shape mismatch: mcp.tool.args.schema_valid is false on every failing span, and the error message reads order_id: expected string, got integer.

The team checks the eval score. It has not moved. The eval test suite covers five order-lookup scenarios, and all five use integer order IDs in their expected arguments. The schema that the test inputs exercise accepts integers. The eval suite was measuring the wrong thing: it covered the new model behavior but not the production schema. The eval score gives a false green.

This is the worked example of an evaluation rubric mismatch. The behavior signal caught the regression immediately. The quality signal missed it because the test set did not cover the production schema. The fix required both updating the tool handler to accept integers, and updating the eval test suite to cover the argument type contract. The dashboard tells the story: a sharp spike in lookup_order error rate at 14:23, flat eval scores throughout, recovery at 14:47 after the handler fix deployed.

The lesson: behavior signals and quality signals catch different failure modes. A complete monitoring setup needs both, and the quality signal is only as good as the test set it runs against.

The honest limitations

Telemetry is a backstop, not a substitute for offline evaluation.

Span data tells you the agent's tool calls failed, its latency spiked, its token spend grew. It does not tell you whether any of those changes affected the quality of the agent's output. An agent that enters a retry loop, exhausts its retries, and returns an error message has clearly degraded in behavior; the span data catches that. An agent that completes every task with green spans but produces outputs that are subtly wrong, incomplete, or not useful does not register in span data at all.

You still need a held-out test set. A held-out test set is a collection of task inputs and expected outputs that you do not use for prompt development. You reserve it for evaluation, so that the eval score reflects true generalization rather than performance on examples the prompt was tuned to handle. Without a held-out set, your eval suite measures how well the agent handles the examples you thought of, not how well it handles what users actually send.

You still need humans in the loop sometimes. Automated evals with LLM-as-judge are useful for scaling feedback collection, but they carry the biases of the judge model. A judge model trained to prefer certain output styles may rate confident wrong answers higher than tentative correct ones. Human review of a random sample of production outputs, at whatever frequency your team can sustain, is the only way to ground-truth the judge's scores and catch systematic biases before they mask real quality regressions.

Monitoring is not a substitute for understanding your agent. If you do not know why your agent makes certain decisions, you cannot reliably distinguish expected behavior variation from degradation. Instrumentation gives you data. Interpreting that data still requires understanding what the agent is supposed to do.

Frequently asked questions

Can OpenTelemetry alone tell me whether my agent's answers are getting worse?

No. OTel spans measure behavior: latency, error rate, token spend, retry rate. They do not measure quality: whether the agent's answer was correct, helpful, or safe. You need a separate quality signal, either human ratings or an automated eval suite, and you emit that signal as a custom metric or tagged event alongside your spans.

What is the fastest sign of agent degradation I can observe in spans?

Tool-call error rate is the fastest-moving indicator. When a model rollout changes function-calling format, or when an upstream API changes its contract, tool calls start failing within minutes. The error rate spike in span data typically precedes user complaints by hours. Set an alert on tool-call error rate greater than baseline plus two standard deviations.

How often should I run my eval suite to catch prompt regressions?

Run it on every deploy that changes a prompt, a model version, or a tool schema. For continuous model serving where the model version can change upstream without your deploy, run the eval suite on a schedule: daily at minimum, hourly if the suite is fast enough. Emit the score as a metric so you can chart it alongside your span data and correlate score drops with specific events.

What eval score drop should trigger a rollback?

A delta greater than five percentage points from the prior deploy baseline is a practical threshold for most teams. The right value depends on your task: a medical or legal agent warrants a tighter threshold (two to three points), while a creative writing agent may tolerate wider variance. Set the threshold before you instrument, not after you see the first regression, or the threshold will be anchored to an incident rather than to your risk tolerance.

Does urgentry store OTel metrics alongside traces?

Yes. urgentry accepts OTLP/HTTP at port 4318 for both traces and metrics. Eval scores emitted as OTel metrics arrive at the same endpoint as your spans and appear in urgentry alongside trace data. You can correlate a score drop with a latency spike or an error rate change by timestamp without needing a second backend.

Sources and further reading

  1. OpenTelemetry GenAI semantic conventions — defines gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.operation.name. Experimental tier as of May 2026.
  2. OpenTelemetry Protocol (OTLP) specification — transport format, endpoint conventions, and content types for OTLP/HTTP. Covers /v1/traces, /v1/metrics, and /v1/logs signal paths.
  3. opentelemetry-instrumentation-langchain (OpenLLMetry) — the community instrumentation package for LangChain. Installs a callback handler that intercepts chain runs, LLM calls, and tool calls and emits GenAI semconv spans.
  4. promptfoo documentation — open-source eval framework for LLM applications. Supports YAML-defined test cases, LLM-as-judge assertions, and CI/CD integration for running evals on every deploy.
  5. OpenAI Evals — the evaluation framework published by OpenAI covering grading patterns, reference implementations for LLM-as-judge, and guidance on test set construction for regression detection.
  6. Functional Source License 1.1 (FSL-1.1-Apache-2.0) — the license under which urgentry is distributed. Grants use rights; converts to Apache 2.0 after two years.
  7. urgentry compatibility matrix — the published protocol compatibility audit, including OTLP/HTTP ingest coverage at /v1/traces and /v1/metrics, and Sentry SDK compatibility with only the DSN changed.
  8. opentelemetry-python — the Python OpenTelemetry SDK documentation covering TracerProvider, MeterProvider, BatchSpanProcessor, and OTLP exporter configuration.

One binary. Agent spans, eval metrics, and error issues together.

urgentry accepts OTLP/HTTP at /v1/traces and /v1/metrics in the same binary that handles Sentry-compatible error tracking. Tool call exceptions become issues. Eval scores land as metrics. Latency and token data land as span attributes. Point your agent's OTel exporter at port 4318 and the data is there in seconds.