Tracking AI agent errors: spans, tool calls, and token costs.
An AI agent that fails silently costs more than one that fails loudly. Token spend climbs, retries compound, and the task either produces low-quality output or never completes. This guide covers how to instrument a LangChain or custom Python agent with OpenTelemetry to capture which tool calls failed, what each task cost in tokens, where latency concentrated, and how to alert before a regression reaches production.
20 seconds. Agent failures produce four signals worth capturing: tool call exceptions, tool call timing, token counts per model call, and retry counts per task. All four land in OpenTelemetry spans. The GenAI semantic conventions define standard attributes for token usage and model identity. A one-line integration package instruments LangChain automatically; custom agents need three to five spans per task loop written by hand.
60 seconds. Point the OTel exporter at urgentry’s OTLP endpoint on port 4318. The same binary that handles your application error tracking receives the agent spans, groups tool call exceptions into issues, and stores token counts as span attributes for dashboard queries. No separate pipeline, no second backend, no Collector sidecar required.
The hardest part of agent observability is the gap between what telemetry can measure and whether the agent did the job well. Token cost and span timing tell you the agent ran and how much it spent. They do not tell you whether the output was correct. This guide covers what telemetry can give you and names the gap it cannot close.
What an agent error looks like
Agent failures come in five shapes, and most of them go undetected without instrumentation.
Malformed model output. The LLM returns JSON that fails to parse against the expected schema. The agent raises a ValidationError or JSONDecodeError, catches it internally, and either retries the call or proceeds with a default. No exception surfaces to the caller. Without a span capturing the failure, you have no record it occurred.
Tool call returning an error status. The agent calls a search API, a database query function, or a code executor. The tool returns an HTTP 500 or raises an exception. The agent may retry automatically. Each retry costs tokens (the error message goes back into context) and adds latency. Three retries on a broken tool can triple the cost of a task.
Retrying a broken plan. The agent produces a plan, executes it, observes a failure, and revises the plan. If the plan revision is incorrect, the agent retries the same broken sequence with minor variations. This produces a long sequence of tool call spans, each failing, with token cost accumulating at each step. Without span data, you see only a slow task and a final failure.
Cost running over budget. A task the agent should complete in five model calls takes twenty. The context window fills with retry history. Each call becomes more expensive because the input token count grows. By the time the task finishes or times out, the cost is three to five times the expected amount.
Silent low-quality output. The agent completes the task, returns a result, and exits with no errors. But the output is wrong, incomplete, or lower quality than expected. No span or metric captures this. Token cost and latency look normal. This is the signal that observability cannot close, and the last section of this guide addresses it.
The four signals you want
Before writing any instrumentation code, identify what you want to capture. Four signals cover the majority of actionable agent observability.
1. Errors: tool call failures, parse failures, validation failures. Every exception raised in a tool function, every JSON parse failure on model output, every schema validation rejection. These go into span events via span.record_exception(exc) and set the span status to ERROR. They produce issues in urgentry the same way application exceptions do.
2. Tool call timing. Each tool invocation gets its own span with a start time and end time. Sort by duration to find the tools that add the most latency to the agent loop. A search tool that takes 4 seconds to respond on every call may be the bottleneck even if it never fails. A code executor that occasionally stalls for 30 seconds produces a p99 spike that a median dashboard would hide.
3. Token cost per task. Sum gen_ai.usage.input_tokens and gen_ai.usage.output_tokens across all model call spans for a single task. Multiply by the model’s per-token price. The result is cost per completed task: the budget metric your team can act on. Raw token volume across all tasks is less useful because it conflates task count and task complexity.
4. Retry counts and patterns. Track how many times the agent retried a tool call or a model call within a single task. A retry count of 1 is normal. A retry count above 3 on a single task suggests the agent is stuck in a failure loop. Capture this as a span attribute (agent.retry_count) or as an event on the task-level span.
The OTel semantic conventions for GenAI
The OpenTelemetry project maintains a set of GenAI semantic conventions that define standard attribute names for AI model calls. Using these attributes makes your spans queryable with the same keys as any other service instrumented with the same conventions, and they are what urgentry’s GenAI dashboard queries expect.
| Attribute | Type | Description |
|---|---|---|
gen_ai.system |
string | The AI provider. E.g. openai, anthropic, mistral. Defined values in the spec; use _OTHER for others. |
gen_ai.request.model |
string | The model name requested. E.g. gpt-4o, claude-sonnet-4-5. |
gen_ai.response.model |
string | The model that responded, from the response payload. May differ from the requested model when an alias or fallback is used. |
gen_ai.usage.input_tokens |
int | Tokens consumed by the input (prompt). Set after the model responds, on span end. |
gen_ai.usage.output_tokens |
int | Tokens in the model’s response. Set after the model responds, on span end. |
gen_ai.operation.name |
string | The operation type. chat for conversational completions, text_completion for non-chat, embeddings for embedding calls. |
As of May 2026, these attributes are in the experimental tier of the OTel semantic conventions. That means the attribute names are not guaranteed stable across spec versions. In practice, the major instrumentation libraries (opentelemetry-instrumentation-langchain, opentelemetry-instrumentation-openai) already emit these attributes, and urgentry’s ingest and query engine uses them. The attribute names have been stable in the community for over a year despite the formal experimental label. There is ongoing discussion in the OTel GenAI working group about when to promote them; the contested items are around streaming token counts and multi-modal input token measurement, not the core attributes above.
Beyond the GenAI conventions, add your own attributes for task-level context:
task.id— a unique identifier for the agent task, for grouping all spans from one runtask.type— the category of task, for filtering and aggregationagent.retry_count— how many retries occurred on this spantool.name— the name of the function or API the agent called
Instrument a LangChain agent
LangChain has a community-maintained OpenTelemetry integration. It installs a callback handler that hooks into LangChain’s internal event system and emits spans for chain runs, LLM calls, and tool calls.
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-http \
opentelemetry-instrumentation-langchain
Initialize the OTel SDK and call the instrumentor before your agent runs:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
import os
# Configure the OTLP exporter to point at urgentry.
exporter = OTLPSpanExporter(
endpoint=os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] + "/v1/traces",
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Instrument LangChain. This must run before any LangChain imports are used.
LangchainInstrumentor().instrument()
With the instrumentor active, every LangChain agent run produces a span tree. A typical run generates:
- A root span for the
AgentExecutorrun, covering the full task duration. - Child spans for each LLM call (
ChatOpenAI.invokeor equivalent), carryinggen_ai.usage.input_tokensandgen_ai.usage.output_tokens. - Child spans for each tool invocation, named after the tool function.
- Exception events on any span where an exception was raised and caught internally by LangChain.
What the automatic instrumentation does not add: your task-level attributes (task.id, task.type), retry counts you track in your own loop, and custom tool attributes that go beyond name and duration. Add those yourself by getting the current span and setting attributes before or after the agent call:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def run_agent_task(task_id: str, task_type: str, user_query: str) -> str:
with tracer.start_as_current_span("agent.task") as span:
span.set_attribute("task.id", task_id)
span.set_attribute("task.type", task_type)
try:
result = agent_executor.invoke({"input": user_query})
span.set_attribute("task.success", True)
return result["output"]
except Exception as exc:
span.record_exception(exc)
span.set_status(trace.StatusCode.ERROR, str(exc))
raise
The agent.task span becomes the parent of all LangChain-generated child spans. Query by task.id in urgentry to see the full span tree for a single task run, including every LLM call and every tool invocation.
Instrument a custom Python orchestration
If you build your own agent loop with the OpenAI Python SDK or Anthropic SDK directly, no automatic instrumentation exists. Write the spans by hand. The pattern is three levels: a task span at the top, a model call span for each LLM request, and a tool span for each function the agent dispatches.
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.trace import StatusCode
import openai
# OTel setup.
exporter = OTLPSpanExporter(
endpoint=os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"] + "/v1/traces",
)
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-agent")
client = openai.OpenAI()
def call_model(messages: list, model: str = "gpt-4o") -> dict:
"""Call the model and emit a span with token counts."""
with tracer.start_as_current_span("gen_ai.chat") as span:
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.operation.name", "chat")
try:
response = client.chat.completions.create(
model=model,
messages=messages,
)
usage = response.usage
span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
span.set_attribute("gen_ai.response.model", response.model)
return response
except Exception as exc:
span.record_exception(exc)
span.set_status(StatusCode.ERROR, str(exc))
raise
def call_tool(tool_name: str, tool_fn, *args, **kwargs):
"""Call a tool function and emit a span. Records exceptions on failure."""
with tracer.start_as_current_span(f"tool.{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
try:
result = tool_fn(*args, **kwargs)
return result
except Exception as exc:
span.record_exception(exc)
span.set_status(StatusCode.ERROR, str(exc))
raise
def run_task(task_id: str, task_type: str, initial_prompt: str) -> str:
"""Top-level task span. All model calls and tool calls are children."""
with tracer.start_as_current_span("agent.task") as span:
span.set_attribute("task.id", task_id)
span.set_attribute("task.type", task_type)
messages = [{"role": "user", "content": initial_prompt}]
retry_count = 0
max_steps = 10
for step in range(max_steps):
response = call_model(messages)
choice = response.choices[0]
# If the model is done, return.
if choice.finish_reason == "stop":
span.set_attribute("agent.steps", step + 1)
span.set_attribute("agent.retry_count", retry_count)
return choice.message.content
# If the model called a tool, dispatch it.
if choice.finish_reason == "tool_calls":
for tool_call in choice.message.tool_calls:
tool_name = tool_call.function.name
try:
tool_result = call_tool(
tool_name,
TOOL_REGISTRY[tool_name],
tool_call.function.arguments,
)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(tool_result),
})
except Exception:
# Tool failed. Increment retry count and continue.
retry_count += 1
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": "Error: tool call failed.",
})
# Max steps reached without completion.
span.set_status(StatusCode.ERROR, "Max steps reached without completion")
span.set_attribute("agent.retry_count", retry_count)
raise RuntimeError(f"Agent did not complete task {task_id} in {max_steps} steps")
This pattern gives you one span per model call with token counts, one span per tool call with exception capture, and one task-level span with retry count and step count. The task span is the parent; urgentry shows the full tree when you query by trace ID or by task.id.
Token counts arrive in the model response. Set gen_ai.usage.input_tokens and gen_ai.usage.output_tokens after you receive the response, inside the span block. Setting them before the call means they are absent on timeout or exception, which is exactly when you most want them.
Point OTLP at urgentry
urgentry accepts OTLP/HTTP at port 4318, the same port as any standard OTLP/HTTP receiver. Set these environment variables before running your agent:
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-urgentry-host
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME=my-agent
If you configure the exporter programmatically (as in the examples above), pass the endpoint directly instead of reading the environment variable in the exporter constructor. Both approaches work; the environment variable approach is more portable and works with instrumentation libraries that read it themselves.
For local development, urgentry runs on a laptop with no external dependencies:
curl -fsSL https://urgentry.com/install.sh | sh
./urgentry serve --role=all
# OTLP endpoint is now available at http://localhost:4318
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_SERVICE_NAME=my-agent-dev
Verify in 60 seconds
Run a single agent task. In urgentry, open the Traces view. The agent.task span appears within a few seconds. Expand the trace to see the model call child spans and tool call child spans. Check that gen_ai.usage.input_tokens appears on the model call spans. If the trace does not appear, verify that the endpoint URL does not include the signal path. The SDK appends /v1/traces automatically; setting OTEL_EXPORTER_OTLP_ENDPOINT=https://host/v1/traces causes the SDK to post to /v1/traces/v1/traces, which returns 404.
What to alert on
Dashboards are useful after the fact. Alerts fire before a problem reaches the user.
1. Tool call error rate above threshold. Count spans with name matching tool.* and status ERROR, divided by total tool call spans, over a rolling five-minute window. Alert when this rate exceeds 10%. A single broken tool can account for all errors; use the tool.name attribute to break down the rate by tool.
2. Retry count above 3 on any single task. Alert when any span with name agent.task carries agent.retry_count greater than 3. This fires when a single task enters a failure loop. A task that retried three times and is still running is a candidate for manual review or automatic cancellation.
3. Cost per task spike of 2x. Compute the rolling median of total tokens per task (summed across all model call spans in a task trace). Alert when any single task exceeds 2x that median. A task that costs twice the typical amount either has legitimate scope or is stuck in a retry loop that your other alerts may have missed.
4. p99 latency on the agent loop. Alert when the p99 duration of agent.task spans exceeds your expected maximum. Use the 99th percentile, not the average: average latency for agent tasks is dominated by the normal cases and hides the tail. A task that should complete in under 30 seconds but occasionally takes 3 minutes shows clearly in p99 and is invisible in p50.
Cost-per-task dashboards
Raw token volume is the wrong metric. If you deploy six agent instances and scale to three times the request volume, raw token volume triples. That’s expected. The metric you want is tokens consumed per completed task, which should stay flat as you scale.
To compute it from span data, build a query that:
- Groups spans by
task.id. - Sums
gen_ai.usage.input_tokensandgen_ai.usage.output_tokensacross all model call spans in each group. - Filters to tasks where the root
agent.taskspan has status OK (completed tasks only, not failed or in-flight). - Computes the median, p90, and p99 of the per-task token sum.
Plot this over time. A baseline forms after a week of production traffic. When the p90 drifts upward, the agent is spending more tokens per task than before. The drift happens for three reasons: the task mix changed (users are asking harder questions), the context window grew (the agent pulls in more data per task), or retry rates increased (more tasks hit failures and retry). The span data tells you which.
For cost in dollars, multiply the per-task token totals by the model’s per-token price. Store the price as a constant in your dashboard configuration. When a model price changes, update the constant and the cost history re-computes correctly.
Break down cost per task by task.type. An agent that handles both simple and complex task types may show a flat average cost while the complex type has grown 3x. Segmenting by task type exposes regressions that cross-task averages mask.
The agent-output-quality gap
Token cost and span timing tell you the agent ran and how much it spent. They tell you when tool calls failed and how many times the agent retried. They do not tell you whether the agent’s output was correct.
An agent that completes in twelve tool calls, within budget, with no exceptions, can still produce a wrong answer, an incomplete response, or a plan that ignores a key constraint. The spans show all-green. The user gets a bad result.
This is the gap nobody has closed with telemetry alone. The approaches teams currently use:
- Human review sampling. Route a fraction of agent outputs to a human reviewer who scores them. Feed scores back as custom span attributes (
output.quality_score) or as separate metrics. This scales poorly but establishes ground truth. - LLM-as-judge. Pass the agent’s output to a second model with a rubric. The judge outputs a score or a pass/fail. Store the score as a span attribute on the task span. This scales better but introduces a second model’s failure modes into your quality signal.
- Outcome tracking. For agents with a downstream measurable outcome (a generated email that gets a reply, a code patch that passes tests, a plan that the user accepts), track the outcome as a separate event linked to the task ID. Correlate outcome rates with span data to find patterns.
Telemetry catches the agent failing to run correctly. Catching the agent running correctly but producing wrong output requires a quality measurement layer on top of telemetry. That layer is outside the scope of what OpenTelemetry defines, and there is no consensus yet on what it should look like.
Frequently asked questions
Do I need to change my LangChain agent code to get OpenTelemetry spans?
No application-level changes are required if you use the opentelemetry-instrumentation-langchain package. It installs a callback handler that intercepts LangChain’s internal events and emits spans automatically. You do need to configure the OTel SDK and set the OTLP endpoint before the instrumentation takes effect.
Which OTel semantic conventions cover token usage?
The OpenTelemetry GenAI semantic conventions define gen_ai.usage.input_tokens and gen_ai.usage.output_tokens as standard span attributes. These are in the experimental tier as of May 2026 but are widely adopted across the major instrumentation libraries. The gen_ai.system and gen_ai.request.model attributes are part of the same spec.
How do I compute cost-per-task from raw token spans?
Group spans by a task identifier attribute (e.g. task.id), sum gen_ai.usage.input_tokens and gen_ai.usage.output_tokens across all model call spans in that group, then multiply by the per-token price for the model. Filter to completed tasks (root span status OK) to exclude in-flight and failed runs from your baseline.
Will urgentry receive spans from the OpenAI Python SDK directly?
Yes. The opentelemetry-instrumentation-openai package instruments the OpenAI Python SDK and emits spans following the GenAI semconv. Set OTEL_EXPORTER_OTLP_ENDPOINT to your urgentry instance and the spans arrive at /v1/traces with token counts and model attributes attached.
What’s the difference between a tool call error and a model error in agent telemetry?
A tool call error means the function the agent invoked raised an exception or returned a non-success status. A model error means the LLM call itself failed: rate limit hit, context length exceeded, or a network timeout. Both set span status to ERROR and call span.record_exception, but the gen_ai.operation.name attribute distinguishes them: chat for model calls, and a tool-specific span name (e.g. tool.search) for tool invocations.
Sources and further reading
- OpenTelemetry GenAI semantic conventions — defines
gen_ai.system,gen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens, andgen_ai.operation.name. Experimental tier as of May 2026. - opentelemetry-instrumentation-langchain (OpenLLMetry) — the community instrumentation package for LangChain, LlamaIndex, and other frameworks. Emits GenAI semconv spans from LangChain callback events.
- OpenAI Python SDK — the official Python client. Use with
opentelemetry-instrumentation-openaito get automatic spans from SDK calls. - OpenTelemetry Protocol (OTLP) specification — transport format, endpoint conventions, and content types for OTLP/HTTP.
- Functional Source License 1.1 (FSL-1.1-Apache-2.0) — the license under which urgentry is distributed. Grants use rights; converts to Apache 2.0 after two years.
- urgentry compatibility audit — the published SDK and protocol compatibility matrix, including OTLP/HTTP ingest coverage and GenAI span support.
One binary. Agent spans, token costs, and error issues together.
urgentry accepts OTLP/HTTP at /v1/traces in the same binary that handles error tracking. Tool call exceptions become issues. Token counts land as span attributes. Point your agent’s OTel exporter at port 4318 and the data is there in seconds.