Guide AI agents & MCP observability ~10 min read Updated May 25, 2026

Self-hosted LLM observability: Langfuse, Phoenix, and where errors fit

Q: Which OpenTelemetry attributes carry token usage?

The GenAI semantic conventions define gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on spans, plus gen_ai.request.model for the model name. They are still marked experimental but are widely adopted, and they are what the LLM observability tools read to build their cost dashboards.

The LLM observability tools answer one question well: what did the model do, and what did it cost? They answer a second question badly or not at all: what broke in the code around the model call? Those are two layers, and most teams need both. This guide maps the self-hosted options for the first layer and shows where an error tracker covers the second.

TL;DR

20 seconds. Langfuse, Arize Phoenix, and Helicone all self-host and all capture the LLM-specific layer: prompts, completions, token counts, latency per call, eval scores. They are open source and they speak OTLP. None of them is an error tracker. The exceptions in your retrieval step, your tool functions, and your JSON parsing land outside their model, which is where a Sentry-compatible tracker like urgentry picks them up.

60 seconds. Pick one tool for the LLM layer. They overlap so much on tracing and cost that running two is usually wasted effort. Send your GenAI spans there. Then point your application SDK and OTLP exporter at urgentry for everything that is not a model call: the request that 500s before it reaches the LLM, the tool that throws, the rate-limit retry loop that never exits. Share one W3C traceparent across both and a stack trace in urgentry links straight to its model trace in Langfuse. The whole pair runs on a small VPS for the price of a domain name.

This guide covers what each LLM observability tool actually does, the gap they leave open, how an error tracker fills it, the wiring between the two layers, and the cost math that pushes teams off managed plans in the first place.

Two layers, one stack

"LLM observability" gets used as if it were one thing. In a running system it is two.

The first layer is everything about the model call: the prompt that went in, the completion that came back, how many tokens each direction burned, which model answered, how long it took, and whether an eval scored the output as good. This is the layer that did not exist three years ago and the layer the new tools were built for.

The second layer is the code wrapped around that call. An LLM app is still an app. It has an HTTP handler that can throw before it ever reaches the model. It has a retrieval step that queries a vector store that can time out. It has tool functions the agent invokes that raise exceptions. It parses the model's JSON output and that parse can fail. None of those are model problems, and none of them show up in a tool built to watch the model.

Teams reach for an LLM observability platform, wire it in, and then discover their pager is quiet while users complain. The model calls all look fine on the dashboard because the failures happened in the layer the dashboard does not watch. A better LLM tool will not help. An error tracker will, doing the job it has always done next to the LLM tool doing the job it was built for.

The self-hosted LLM tools, briefly

Three names come up when developers compare notes on self-hosting this layer. They differ in emphasis more than in core capability.

Langfuse is MIT-licensed and built around prompt management, tracing, and evals. It runs as an OpenTelemetry backend: point an OTLP exporter at /api/public/otel on your instance and traces arrive without a vendor SDK. The OTLP endpoint landed in v3.22.0 and has been the recommended ingest path since. If your priority is versioning prompts and scoring outputs over time, Langfuse leans that way.

Arize Phoenix is open source and built on OpenTelemetry and OpenInference instrumentation. It accepts traces over OTLP, ships as a Docker image, and keeps your data on your own infrastructure. Its strength is trace analysis and experiment comparison: stepping through a retrieval-augmented call to see where time and tokens went. Arize sells a hosted product (AX) on top, but Phoenix itself you run yourself.

Helicone (YC W23) started as a proxy you route model calls through, which makes integration close to one line but puts a hop in your request path. It is open source and self-hostable via Docker or Kubernetes, though its cloud Pro plan starts around $79/month and climbs with log throughput. If a gateway sitting in front of your providers fits your architecture, Helicone fits with it.

Newer entrants keep arriving. A self-hosted LLM observability tool called Torrix surfaced on launch boards in May 2026, per @tianhuil, with builders noting the category has moved past prompt playgrounds toward production telemetry. The shared trait across all of them is OTLP: pick the one whose emphasis matches your work, and the wire format stays portable.

What none of them group: the errors around the call

The failure these tools miss shows up in a plain retrieval-augmented endpoint:

async def answer(question: str) -> str:
    docs = await vector_store.search(question, k=8)   # can time out
    context = "\n".join(d.text for d in docs)         # can raise on empty
    reply = await llm.chat(prompt(question, context)) # the model call
    data = json.loads(reply.content)                  # can fail to parse
    return data["answer"]                             # can KeyError

Exactly one line in that function is a model call. The LLM observability tool watches that line. The other four lines are where production breaks, and when they break the tool shows a clean trace that simply stops, or no trace at all, because the model was never reached.

That is error tracking, and it has well-understood mechanics: capture the exception with its stack trace, group it by fingerprint so a thousand identical timeouts collapse into one issue, attach the request context, and alert when a new fingerprint appears or an old one spikes after a deploy. LLM observability tools do not group exceptions this way because grouping exceptions is not what they were built to do. They were built to make the model call legible, and they do that well.

The category names give it away. The model-call layer is observability: many signals, dashboards, exploration. The surrounding-code layer is error tracking: discrete failures, grouped and assigned and closed. Our logs vs traces vs errors guide draws the same line in a non-AI context, and it holds here unchanged.

Where urgentry fits

urgentry is a Sentry-compatible error tracker that also ingests OTLP. In an LLM stack it owns the second layer and stays out of the first.

Your application already has a Sentry SDK or could have one in a few lines. That SDK captures the unhandled exceptions in the code around the model call, including the four failure points in the example above, and ships them to urgentry as Sentry envelopes. The same binary also listens for OTLP traces, so the spans your app emits for the retrieval step, the database query, and the HTTP call all arrive at one backend. Errors get grouped into issues; traces give you the timeline around each one.

urgentry does not touch the LLM layer. It will not score a completion, diff two prompt versions, or build a cost-per-conversation report from token counts. If you send it a span carrying gen_ai.usage.input_tokens it stores the attribute, but it does not turn that into the prompt-and-eval workflow Langfuse gives you. That is a real boundary, and it is the same kind of honest gap we describe in self-hosted session replay: urgentry covers one job completely and points you at the right neighbor for the other.

The payoff of keeping urgentry in the picture is that the reliability layer of an AI app behaves like the reliability layer of any other app. The on-call engineer opens the same issue list, reads the same stack traces, and gets the same alert when a deploy introduces a new exception. The fact that there is an LLM somewhere downstream does not change how you triage a 500.

Wiring the two layers together

The two layers connect through OTLP and a shared trace ID. You have two clean ways to set it up.

The first is to fan out at an OpenTelemetry Collector. Your app exports once; the Collector duplicates the stream and sends a copy to each backend.

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlphttp/langfuse:
    endpoint: https://langfuse.internal/api/public/otel
    headers:
      Authorization: "Basic ${LANGFUSE_AUTH}"
  otlphttp/urgentry:
    endpoint: https://errors.yourdomain.com

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp/langfuse, otlphttp/urgentry]

The second is to skip the duplication and split by concern: let the GenAI instrumentation send model spans to Langfuse, and let the Sentry SDK send exceptions to urgentry. Either way, the link between them is the W3C traceparent. Because both backends record the same trace ID, an exception you are reading in urgentry carries the ID you paste into Langfuse to see what the model was doing when the surrounding code fell over.

On the error side, the capture is ordinary Sentry SDK code with the trace context attached:

import sentry_sdk
from opentelemetry import trace

try:
    data = json.loads(reply.content)
except json.JSONDecodeError as exc:
    span = trace.get_current_span()
    trace_id = format(span.get_span_context().trace_id, "032x")
    sentry_sdk.set_tag("trace_id", trace_id)   # the join key into Langfuse
    sentry_sdk.capture_exception(exc)
    raise

That tag is the whole trick. The error tracker groups and alerts; the LLM tool holds the prompt and tokens; the trace ID stitches a single failure across both views. If you have already read our tracking AI agent errors guide, this is the same instrumentation, now pointed at two backends on purpose instead of one.

The cost math that starts the conversation

Most teams do not arrive here on principle. They arrive on the bill.

One developer summed up the move: Helicone wanted around $200/month at their LLM volume, Arize quoted roughly $400, and self-hosted Langfuse on a $20/month Hetzner box gave them the same trace depth and cost dashboard with full SQL access to their own data, three months in with zero downtime (per @hndx74). The pattern shows up across observability generally; as @OneManSaas put it, the bills are mostly a function of which framework you picked, not how much you logged.

The honest version of the math has two columns. Managed plans charge by token volume or log throughput, so the bill grows with traffic and you do not touch infrastructure. Self-hosting flattens the per-event cost to roughly zero after a fixed monthly box, and you pay instead in the hours you spend running it. For LLM observability specifically the box stays small, because trace volume is bounded by how many model calls you make, not by raw log firehose. A single VPS in the $5 to $20 range absorbs the LLM tool and an urgentry instance together at the volumes most teams actually run.

The catch is the same one self-hosting always carries: the bursty maintenance nobody quotes. Self-hosted stacks that ship breaking migrations turn upgrades into events rather than background noise, a pain operators of self-hosted Sentry know well, noted by @m13v_. The two ways to keep that cost low are to pick tools that take upgrades seriously and to keep the moving-part count down. The first is why we wrote zero-downtime upgrades; the second is why urgentry ships as one binary instead of a container fleet.

Choosing: one tool, two, or the whole thing

The decision is smaller than the tool count suggests.

If you are debugging prompts and comparing model outputs and your app almost never throws, one LLM observability tool is enough. Pick Langfuse for prompt and eval work, Phoenix for trace analysis, Helicone if a gateway suits your setup, and stop there.

If you are running an LLM feature in production where users notice when it breaks, you want both layers. The LLM tool for the model call, an error tracker for the code around it. Skipping the second layer is how teams end up with a green dashboard and an angry inbox.

And if you are already running urgentry for your non-AI services, the AI feature does not need a second error backend. It needs the LLM layer added beside the tracker you already operate, sharing a trace ID so the two views line up. The error tracking you have keeps working; you bolt on the model-call visibility and wire the trace context through.

Frequently asked questions

Is urgentry an LLM observability tool like Langfuse?

No. urgentry is a Sentry-compatible error tracker that also ingests OTLP traces. It captures exceptions and latency in the code around your model calls, groups them into issues, and alerts on regressions. It does not score completions, version prompts, or run evals. For that layer you run Langfuse or Phoenix alongside it.

Can Langfuse and urgentry receive the same OTLP traces?

Yes. Both speak OTLP over HTTP. You can fan a trace out to both from one OpenTelemetry Collector, or route LLM spans to Langfuse and application errors to urgentry. As long as both ends share the same W3C traceparent, an exception in urgentry links back to the matching model trace in Langfuse by trace ID.

What does self-hosting these tools actually cost?

The software is free. Langfuse and Phoenix are open source, and Helicone publishes a self-host build too. The cost is the box plus the hours you spend running it. One developer reported moving to self-hosted Langfuse on a $20/month Hetzner instance after being quoted around $200/month on Helicone cloud and $400 on Arize at their token volume.

Do I need all three of Langfuse, Phoenix, and Helicone?

No. Pick one for the LLM layer; they overlap heavily on tracing and cost dashboards. Langfuse leans toward prompt management and evals, Phoenix toward trace analysis and experiments, Helicone toward gateway-style proxying. Then add an error tracker for the application failures none of them are built to group.

Which OpenTelemetry attributes carry token usage?

The GenAI semantic conventions define gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on spans, plus gen_ai.request.model for the model name. They are still marked experimental but are widely adopted, and they are what the LLM observability tools read to build their cost dashboards.

Sources

Langfuse self-hosting and OpenTelemetry docs — confirms the MIT license, the OTLP backend endpoint at /api/public/otel, and the v3.22.0 ingest path.
Arize Phoenix documentation — Phoenix as an open-source, OpenTelemetry-native tracing tool that self-hosts via Docker and keeps data on your own infrastructure.
Helicone pricing and the Helicone source repository — the open-source self-host build, the proxy model, and the cloud plan tiers.
OpenTelemetry GenAI span semantic conventions — the canonical reference for gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.request.model.
Inside the LLM Call: GenAI Observability with OpenTelemetry — OpenTelemetry's own 2026 walkthrough of how model calls map to spans.
@hndx74 on X — the lived cost comparison: Helicone ~$200/month, Arize ~$400, self-hosted Langfuse on a $20 Hetzner box.

The errors layer for your LLM app, on one binary.

urgentry takes the Sentry SDK and OTLP traces from the code around your model calls and groups them into issues you can triage. Run it next to Langfuse or Phoenix on the same small VPS. SQLite by default, Postgres optional, 218 Sentry API operations covered.

Install urgentry See the compatibility matrix