Articles

OpenTelemetry for LLM tracing: a guide to instrumenting agents and routing spans anywhere

2 July 2026Braintrust Team12 min
TL;DR

OpenTelemetry gives LLM and agent teams a vendor-neutral way to generate, collect, and export traces. With the GenAI semantic conventions, the same instrumentation can capture model calls, token usage, retrieval steps, tool calls, and agent execution in a schema multiple backends can read.

Standard APM tools can show latency, errors, and request health, but they cannot judge whether an LLM output was correct, grounded, or safe for users. Braintrust adds the evaluation layer by scoring the same OpenTelemetry spans against the quality criteria your team defines.

This guide explains the OpenTelemetry concepts used in LLM tracing, how to instrument an LLM or agent application once, how to route GenAI spans to your existing observability tools, and how Braintrust turns those traces into output-quality checks you can use in production.


Why standard APM misses LLM application behavior

Traditional monitoring stacks were built around requests, queries, response codes, latency, and error rates. Datadog, Grafana, and Honeycomb can indicate whether a service is slow, failing, or experiencing unusual traffic, and these signals remain useful for LLM applications. These operational signals stay green when a model call completes normally yet returns an answer that is incorrect, off-policy, or unsupported by the retrieved context. An APM trace can mark the request as healthy because the response was fast and successful, even though the user received a bad answer.

Standard APM can mark an LLM request as healthy based on latency, status, and errors, while an LLM-aware quality view shows whether the answer was correct, grounded, and review-ready.

The black box inside an instrumented service

The information needed to understand an LLM failure usually sits inside the application flow, deeper than any HTTP status code records. A single feature may build a prompt, retrieve context from a vector store, call a model, parse the output, invoke a tool, and loop through another step before returning a response. Standard instrumentation often collapses those operations into a single outbound HTTP span, including duration and status. Engineers can see that the call happened, but not which prompt was sent, which documents were retrieved, which tokens came back, or which tool the model chose.

OpenTelemetry maintains the monitoring stack while adding structure at the LLM layer. Prompts, retrieval steps, tool calls, model responses, and token counts can travel as structured telemetry that any compatible backend can inspect. For how tracing fits the broader monitoring and evaluation picture, see our LLM observability guide.

OpenTelemetry concepts for LLM tracing

OpenTelemetry is a vendor-neutral observability framework for generating, collecting, and exporting telemetry. In LLM applications, spans, traces, and semantic conventions decide how much of the request path telemetry can explain, from the network call through retrieval, model calls, and tool steps.

Spans and traces

A span represents one unit of work with a start time, an end time, attributes, and a status. A trace connects related spans under one trace ID, so you can follow a user request as it moves through the application. For an LLM feature, the root span can represent the full request, while child spans can represent retrieval, model calls, tool invocations, parsing, and other steps in the application flow. Reading the trace tree shows the path the request took and where time was spent across each step.

GenAI semantic conventions

LLM spans become portable when backends agree on the fields each span should contain. OpenTelemetry's GenAI semantic conventions define a shared vocabulary for model calls, token usage, tool calls, and agent steps. Attributes such as gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.operation.name give compatible tools a consistent way to read the same model interaction. Datadog maps OpenTelemetry GenAI semantic conventions v1.37+ into its LLM Observability span schema, so GenAI spans can be analyzed without adding a separate Datadog SDK.

How a vendor-neutral standard lets you swap backends freely

A shared schema separates instrumentation from the backend that receives the spans. Teams can start with one destination, add another, or replace a backend without rewriting the code that emits telemetry. Braintrust implements the OpenTelemetry GenAI semantic conventions, so spans with gen_ai.* attributes map automatically to structured inputs, outputs, metadata, and token metrics when they arrive. For the full mapping of attributes to Braintrust fields, see the OpenTelemetry integration reference.

How to instrument an LLM or agent app with OpenTelemetry

LLM tracing usually combines automatic coverage for common SDK calls with manual spans for application-specific steps.

Auto-instrumentation: Instrumentation libraries can wrap common SDKs and frameworks so calls to providers such as OpenAI and Anthropic, and frameworks such as LangGraph, emit GenAI spans automatically. You add the library at startup, and the instrumentation records model calls as they happen. This works well for existing codebases because coverage starts at the provider and framework layer without changing the application path.

Manual spans: Manual spans cover the steps that automatic instrumentation does not see, such as retrieval, custom tools, routing logic, or post-processing. You open a span around the application step, set the relevant attributes, and close it when the step completes. Pairing automatic provider instrumentation with manual spans gives the trace enough structure to show the full pipeline around the model call.

The span processor and exporter model: After a span is created, it flows through the OpenTelemetry pipeline before it leaves the process. A tracer provider creates spans, a span processor batches them, and an exporter sends them to a destination over the OpenTelemetry Protocol. Because span creation and span export are separate, teams can change where spans go without rewriting the code that emits them.

Pointing pure OTLP at a backend can take as little as two environment variables. The following example routes any OTLP exporter to Braintrust by setting the endpoint and authentication headers.

bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.braintrust.dev/otel
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <Your API Key>, x-bt-parent=project_id:<Your Project ID>"

For a full, runnable walkthrough with framework code, see the OpenTelemetry logging recipe.

Routing LLM spans to your existing stack

A shared GenAI schema lets the same span stream reach your observability tools and an eval backend through one OpenTelemetry pipeline. The application emits spans once, and the routing layer decides which backends receive them.

The multi-exporter pattern

Teams usually route OpenTelemetry spans through the OpenTelemetry Collector when more than one backend needs the same data. The Collector receives spans from the application, applies processing such as batching, sampling, redaction, or enrichment, and forwards the processed spans to one or more exporters. With the Collector in place, adding Datadog, Grafana, Honeycomb, Braintrust, or another destination becomes a configuration change that ships without redeploying the application.

Routing a LangGraph agent to Datadog or Grafana

LangGraph spans emitted through OpenTelemetry are standard OTLP spans, so teams can send them to Datadog or Grafana using the same export path they use for other services. The same spans can also be sent to Braintrust for evaluation, which keeps observability and output-quality scoring connected to one instrumentation path. The agent code stays focused on application behavior, while exporters determine where each span is inspected.

Distributed tracing for agent frameworks and multi-model pipelines

Agent workflows often include several model calls, retrieval steps, tool invocations, and routing decisions before a response reaches the user. Distributed tracing keeps those steps connected under one request-level trace, so teams can inspect the full execution path as a single connected run.

Tracing every hop in an agent run: A LangGraph agent moves through nodes, makes decisions, calls tools, and may call a model several times before returning a final response. OpenTelemetry captures those steps as child spans under the run, which lets the trace show graph execution, node transitions, tool calls, and model calls in the order they happened.

Braintrust's auto-instrumentation registers a global LangChain callback handler that LangGraph also uses, so teams can capture the agent structure without writing manual span code for every step. When the trace reaches Braintrust, engineers can see which node was slow, which tool call failed, and where the run changed direction.

Measuring latency across multiple models: Some pipelines call more than one provider in a single request, such as one model for drafting and another model for review. Because each model call can carry attributes such as gen_ai.request.model and provider metadata, one trace can show span-level latency and token usage across providers.

Trace context keeps those model calls connected even when they run across different services. Braintrust supports cross-service propagation in both directions, so a Braintrust-instrumented service and an OpenTelemetry-instrumented service can share one trace across an HTTP boundary.

Managing tracing overhead: Tracing adds processing work, but a well-configured OpenTelemetry pipeline keeps export work outside the live request path. Exporters batch spans and send them asynchronously, so network export does not block the model call or agent step.

High-volume systems can use sampling to control trace volume. Tail-based sampling at the Collector waits until a trace finishes before deciding whether to keep it, which allows teams to retain error traces while sampling successful runs. That setup preserves visibility into failure cases while keeping routine trace volume manageable.

Scoring output quality with an OpenTelemetry-native eval backend

OpenTelemetry shows how an LLM application executed, but output quality needs a separate measurement layer. A trace can show that a model call returned a 200 in 900 milliseconds, while the answer may still be unfaithful to the source, based on the wrong tool decision, or unsuitable for review. APM tools can keep measuring system health, while Braintrust evaluates the same production traces against the quality criteria your team defines.

Scoring the same spans you already export

Braintrust reads the same gen_ai.* spans your application already emits and runs scoring functions against the captured inputs, outputs, metadata, and token metrics. Because Braintrust implements the GenAI semantic conventions, the fields needed for evaluation already travel through the OpenTelemetry pipeline. Teams can use heuristic checks, model-graded judgments, or custom scorers to measure live production traffic without adding a second instrumentation path.

Turning failures into regression tests

When a scorer flags a bad output, the trace behind the failure can become an eval case. Teams can capture the failing input, define the expected behavior, add the case to a dataset, and run future prompt or agent changes against it before release. Production failures then become a reusable evaluation dataset, which helps prevent the same issue from returning in a later version.

One backend among several

Braintrust works alongside APM tools in an OpenTelemetry setup because each backend answers a different operational question. Datadog, Grafana, and Honeycomb show latency, errors, dashboards, and trace search for system behavior. Braintrust scores whether the model output was correct, grounded, and aligned with the criteria your team set. Exporting the same GenAI spans to both keeps observability and evaluation connected to one source of trace data. To compare options in this category, see our LLM tracing tools roundup, or start free with Braintrust to score your own traces.


FAQs: OpenTelemetry for LLM tracing

Can I send OpenTelemetry spans to two backends at once?

OpenTelemetry can send the same spans to two or more backends at once. Teams can attach multiple exporters to the trace pipeline, but the cleaner production setup is usually an OpenTelemetry Collector that receives spans once and forwards them to each destination. That lets the same GenAI spans go to Datadog, Grafana, or Honeycomb for operational monitoring and to Braintrust for output-quality scoring.

Does Braintrust replace Datadog?

Braintrust and Datadog serve different parts of an LLM application workflow. Datadog remains the right place to monitor infrastructure, services, latency, errors, and broader application health. Braintrust adds evaluation for model behavior, so teams can score outputs, review failures, and turn production issues into repeatable tests without changing their APM stack.

OpenTelemetry tracing or SDK tracing, which should I use?

Use OpenTelemetry when you need to keep telemetry portable across backends, and use SDK tracing when a specific framework or product gives you deeper capture with less setup. Many teams use both because SDK instrumentation can provide detailed traces while OTLP export keeps telemetry available to the rest of the observability stack. Braintrust supports both OpenTelemetry and SDK tracing, so teams can choose based on codebase maturity and tracing coverage.

Which languages does OpenTelemetry support for LLM tracing?

OpenTelemetry has broad language support, and LLM-focused instrumentation commonly covers Python, TypeScript, Java, and Go. What decides ingestion is whether the libraries in your application can emit GenAI-compatible spans or OTLP data. Once spans follow the expected schema, Braintrust can ingest them for evaluation from any supported runtime.

How do I trace a LangGraph agent with OpenTelemetry?

Tracing a LangGraph agent starts with capturing the graph run as a trace and preserving the steps inside it as spans. A useful view covers the model call along with node transitions, tool decisions, retries, and outputs across the run. Braintrust's auto-instrumentation can capture LangGraph activity through the LangChain callback layer, which reduces the amount of manual span code required.

Does OpenTelemetry add latency?

OpenTelemetry usually has low request-path impact when span export is batched and asynchronous. Teams should still configure sampling, batching, payload limits, and redaction carefully, especially for high-volume agent traffic. The goal is to retain enough trace detail for debugging and evaluation while keeping routine telemetry volume under control.

Share

Trace everything

Create an account or use agent setup to start building today.

Sign up free