How to trace LLM apps in Python (2026)

2 July 2026Braintrust Team15 min

TL;DR

Python LLM apps often fail in places that plain logs cannot explain. A request can move through retrieval, model calls, tools, retries, async tasks, and streaming responses before returning an answer, and missing span context makes those failures difficult to debug or reuse.

LLM tracing records each meaningful step as a structured span with inputs, outputs, latency, token usage, cost, metadata, and errors. In Python, the best setup should support OpenTelemetry, decorators, context managers, async execution, and production-safe flushing to keep traces complete under real traffic.

This guide walks through tracing Python LLM and agent applications with Braintrust, including auto-instrumentation, manual spans, OpenTelemetry export, framework integrations, async execution, streaming, and the path from production traces to eval datasets. Start free with Braintrust to turn Python traces into release checks.

What a good LLM trace captures

Before you add instrumentation, define what the trace needs to show. A Python LLM request often moves through retrieval, preprocessing, model calls, tool execution, retries, ranking, and post-processing before the user sees a response. A useful trace keeps those steps connected so you can inspect the full run in order.

A good LLM trace should capture:

Step identity: Each retrieval call, model call, tool invocation, retry, and application function should appear as a named span, so the run is easy to follow from start to finish.

Inputs and outputs: Each span should record the prompt, messages, tool arguments, retrieved context, generated response, or structured object that affected the final answer.

Latency, tokens, and cost: Timing and usage data should be attached to the step that produced it, so teams can isolate slow, expensive, or repeated calls without guessing.

Error location: Exceptions should be recorded on the span where they occurred, with enough context to show whether the failure came from the model, retrieval layer, tool execution, or application code.

Replay context: The trace should preserve enough request data to rerun the case, add it to an eval dataset, or compare future prompt and model changes against the same production example.

Print statements flatten the request into disconnected text, which makes multi-step agent runs difficult to reconstruct. A raw OpenTelemetry exporter gives you structured spans, but generic spans still need LLM-specific fields for model inputs, outputs, token usage, tool calls, and retries. For LLM applications, tracing becomes useful when the span tree reflects how the workflow actually ran.

Set up OpenTelemetry-native tracing in Python

The setup step should give you structured spans without tying the application to one tracing backend. Braintrust is OpenTelemetry-native, so Python teams can send LLM traces to Braintrust while keeping a standards-based instrumentation path.

Auto-instrumentation is the fastest way to start. Call auto_instrument() once at startup, initialize the logger, and supported provider calls are traced automatically with inputs, outputs, latency, token usage, and cost.

python

import os

import braintrust

# Call once at startup — all LLM calls are traced automatically
braintrust.auto_instrument()
braintrust.init_logger(
    api_key=os.environ["BRAINTRUST_API_KEY"],
    project="My Project (Python)",
)

from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.responses.create(
    model="gpt-5-mini",
    input="What is the capital of France?",
)

Provider calls are only part of the trace. Most Python applications also have retrieval, preprocessing, ranking, validation, and business logic that should appear in the same run. For application code, the @traced decorator wraps a function and records its arguments as input and its return value as output, with the function name used as the span name.

python

from braintrust import init_logger, traced

logger = init_logger(project="My Project")


# Decorate a function to trace it automatically
@traced
def fetch_user_data(user_id: str):
    # This function's input (user_id) and output (return value) are logged
    response = requests.get(f"/api/users/{user_id}")
    return response.json()


# Use the function normally
user_data = fetch_user_data("user-123")

Use a context manager when a block of code needs a more explicit span boundary. This pattern is useful when a workflow has intermediate steps that should be logged as metadata before the final output is recorded.

python

from braintrust import init_logger

logger = init_logger(project="My Project")


def complex_workflow(input_text: str):
    # Create a manual span
    with logger.start_span(name="complexWorkflow", span_attributes={"type": "task"}) as span:
        span.log(input=input_text)

        # Step 1
        data = fetch_data(input_text)
        span.log(metadata={"step": "fetch", "record_count": len(data)})

        # Step 2
        processed = process_data(data)
        span.log(metadata={"step": "process"})

        # Log final output
        span.log(output=processed)

Teams that already use OpenTelemetry directly can send their existing spans to Braintrust through the Braintrust span processor. Attach BraintrustSpanProcessor to a standard TracerProvider, and the rest of the application can keep producing spans through the OpenTelemetry APIs it already uses.

python

from braintrust.otel import BraintrustSpanProcessor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# Configure the global OTel tracer provider
provider = TracerProvider()
trace.set_tracer_provider(provider)

# Send spans to Braintrust. `parent` sets the project (or experiment) that
# Braintrust logs the exported spans to. It defaults to the BRAINTRUST_PARENT
# environment variable when not passed explicitly.
provider.add_span_processor(BraintrustSpanProcessor(parent="project_name:my-project"))

Because this path uses OpenTelemetry, it can fit into an existing observability setup without rewriting the application. Provider instrumentation, manual spans, and existing OpenTelemetry spans can all land in the same trace workflow, which gives teams one place to inspect production behavior and turn important runs into evaluation data.

Trace functions, tools, and agent steps

A single provider call can be useful to inspect, but most production agents run through several steps before they return an answer. The trace needs to show that sequence clearly, from the request handler to the application logic, tool calls, and final model response.

Braintrust keeps nested spans connected automatically. When one traced function calls another, the inner span becomes a child of the outer span, so the trace follows the same structure as the code path.

python

from braintrust import init_logger, traced

logger = init_logger(project="My Project")


@traced
def fetch_data(query: str):
    # Database query logic
    return db.query(query)


@traced
def transform_data(data: list):
    # Data transformation logic
    return [transform(item) for item in data]


# Parent span containing child spans
@traced
def pipeline(input_text: str):
    data = fetch_data(input_text)  # Child span 1
    transformed = transform_data(data)  # Child span 2
    return transformed


# Creates a trace with nested spans:
# pipeline
#   └─ fetch_data
#   └─ transform_data
pipeline("user query")

The same pattern works for a multi-step agent. A route handler can open the root span, call application logic, and let the model call appear underneath it. Braintrust tracks the active span through async-friendly context variables, so code deeper in the call stack can log metadata without passing a span object through every function.

python

import os
import random

from braintrust import current_span, init_logger, start_span, traced, wrap_openai
from openai import OpenAI

logger = init_logger()
client = wrap_openai(OpenAI(api_key=os.environ["OPENAI_API_KEY"]))


@traced
def run_llm(input):
    model = "gpt-5-mini" if random.random() > 0.5 else "gpt-5-nano"
    result = client.responses.create(model=model, input=input)
    current_span().log(metadata={"randomModel": model})
    return result.output_text


@traced
def some_logic(input):
    return run_llm("You are a magical wizard. Answer the following question: " + input)


def my_route_handler(payload: dict):
    with start_span() as span:
        output = some_logic(payload["body"])
        span.log(input=payload["body"], output=output, metadata=dict(user_id=payload["user_id"]))
        return output


def main():
    input_text = "How can I improve my productivity?"
    payload = dict(body=input_text, user_id="user123")
    result = my_route_handler(payload)
    print(result)


if __name__ == "__main__":
    main()

Metadata makes these traces easier to search later. Values such as user ID, organization ID, session ID, prompt version, environment, or experiment name should be attached at the request entry point when that context is already available. The same metadata then helps teams isolate a specific slice of production traffic and decide which traces should become eval cases.

python

from braintrust import init_logger
from openai import OpenAI

logger = init_logger(project="My Project")
openai = OpenAI()


def handle_request(user_id: str, org_id: str, prompt: str):
    with logger.start_span(
        name="handleRequest",
        metadata={"user_id": user_id, "org_id": org_id},
        tags=["handle-request"],
    ) as span:
        response = openai.responses.create(
            model="gpt-5-mini",
            input=prompt,
        )
        return response.output_text


handle_request("user-123", "org-456", "What is the capital of France?")

A prompt version or session identifier fits into the same metadata pattern. Define the fields your team needs for debugging and evaluation, then attach them consistently across handlers so traces can be filtered by the production context that shaped the response.

Auto-instrument the Python AI stack

Decorators and context managers are useful for code your team owns. For provider and framework calls, auto-instrumentation is usually the better starting point because it captures supported libraries at startup without changing every call site.

Call auto_instrument() once before the libraries you want to patch are imported. You can also disable a specific integration when you do not want Braintrust to patch it.

python

braintrust.auto_instrument(openrouter=False)

LangChain and LangGraph can be traced through Braintrust's Python integration, so chain runs, graph steps, tool calls, retrievers, and model calls can appear in the same trace workflow as the rest of the application. Use the current braintrust package for this integration, not the deprecated braintrust-langchain package.

LlamaIndex follows the same auto-instrumentation pattern. With tracing enabled at startup, LLM, embedding, and query engine calls can be captured across retrieval and generation, which makes RAG behavior easier to inspect in production.

python

import braintrust

braintrust.auto_instrument()
braintrust.init_logger(project="my-project")

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-5-mini")
response = llm.complete("What is the capital of Australia?")
print(str(response))

FastAPI needs a slightly different pattern because the framework handles HTTP routing rather than LLM execution. Open a root span around the route handler to attach request context, then let auto-instrumented provider calls inside the handler nest under that request span.

Auto and manual instrumentation usually work together in a Python LLM app. Use auto-instrumentation for supported providers and frameworks, then add @traced functions or manual spans around application-specific logic where the boundary or logged data needs to be explicit. When you want to trace one provider client without patching at startup, wrap that client instance directly.

python

import os

import braintrust
from braintrust import wrap_openai
from openai import OpenAI

braintrust.init_logger(
    api_key=os.environ["BRAINTRUST_API_KEY"],
    project="My Project (Python)",
)

# Wrap the OpenAI client to trace all calls
client = wrap_openai(OpenAI(api_key=os.environ["OPENAI_API_KEY"]))
response = client.responses.create(
    model="gpt-5-mini",
    input="What is the capital of France?",
)

Trace async and streaming code in production

Production Python apps often use async execution, streaming responses, and worker pools, so tracing has to preserve span context beyond a simple synchronous request path. Braintrust's tracing APIs support async code, and manual spans can be added around the parts of the application where the request boundary or logged data needs to be explicit.

Streaming responses do not need separate span handling. Streamed chunks are collected and logged as one complete span, so a token-by-token response still appears as a single LLM call instead of fragmented log output.

Concurrency has one edge case worth handling. The Python SDK tracks the active span with context variables, and a plain concurrent.futures.ThreadPoolExecutor does not carry those variables into worker threads. Braintrust provides a traced executor that copies context into each worker, which keeps nested spans attached to the correct parent.

python

import os
import sys

import braintrust
import openai

braintrust.init_logger("math")


@braintrust.traced
def addition(client: openai.OpenAI):
    return client.responses.create(
        model="gpt-5-mini",
        input="What is 1+1?",
    )


@braintrust.traced
def multiplication(client: openai.OpenAI):
    return client.responses.create(
        model="gpt-5-mini",
        input="What is 1*1?",
    )


@braintrust.traced
def main():
    client = braintrust.wrap_openai(openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"]))
    with braintrust.TracedThreadPoolExecutor(max_workers=2) as e:
        try:
            a = e.submit(addition, client=client)
            m = e.submit(multiplication, client=client)
            a.result()
            m.result()
        except Exception as e:
            print("Failed", e, file=sys.stderr)


if __name__ == "__main__":
    main()

Tracing overhead should stay low in production. Logging runs in a background thread, so trace delivery does not block the request path. If logging fails, the SDK retries before giving up, and if no logger is initialized, tracing annotations become a no-op with negligible cost. For high-throughput services, BraintrustSpanProcessor supports filter_ai_spans to keep only AI-related spans and custom_filter to decide span by span what gets sent.

Turn traces into eval datasets

Tracing becomes more useful when the runs you inspect can shape future release checks. A strong production trace gives your team a concrete case to preserve, score, and run again when the prompt, model, retrieval logic, or application code changes.

Start by creating a dataset for the cases you want to keep. init_dataset creates the dataset if it does not already exist, and each record can include an input, an expected output, and metadata that explains the case.

python

import braintrust

# Initialize dataset (creates it if it doesn't exist)
dataset = braintrust.init_dataset(project="My App", name="Customer Support")

# Insert records with input, expected output, and metadata
dataset.insert(
    input={"question": "How do I reset my password?"},
    expected={"answer": "Click 'Forgot Password' on the login page."},
    metadata={"category": "authentication", "difficulty": "easy"},
)

dataset.insert(
    input={"question": "What's your refund policy?"},
    expected={"answer": "Full refunds within 30 days of purchase."},
    metadata={"category": "billing", "difficulty": "easy"},
)

dataset.insert(
    input={"question": "How do I integrate your API with NextJS?"},
    expected={"answer": "Install the SDK and use our React hooks."},
    metadata={"category": "technical", "difficulty": "medium"},
)

# Flush to ensure all records are saved
dataset.flush()
print("Dataset created with 3 records")

The stronger workflow is to promote real production traces into that dataset. When a trace captures a failure, edge case, or high-value interaction, you can fetch the span, map its input and output into a dataset row, and preserve the link back to the original log. That connection keeps the evaluation case grounded in real application behavior instead of a synthetic example.

Finding those traces at scale is what active observability handles. Rather than requiring you to grep logs for the runs worth keeping, Braintrust classifies every production trace with Topics by intent, sentiment, and issue, plus any custom facets you define, so recurring failures and edge cases surface across all traffic instead of only the runs you manually inspect. Those groupings tell you which spans are worth promoting into the dataset.

python

import os

import braintrust
import httpx

project_id = "<your-project-id>"
span_id = "<span-id-from-logs>"

# Fetch the span from project logs
btql_response = httpx.post(
    "https://api.braintrust.dev/btql",
    headers={
        "Authorization": f"Bearer {os.environ['BRAINTRUST_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "query": f"SELECT id, input, output FROM project_logs('{project_id}') WHERE span_id = '{span_id}' LIMIT 1",
    },
)
span = btql_response.json()["data"][0]

# Insert into the dataset, mapping span fields to dataset row format
dataset = braintrust.init_dataset(project="My App", name="Customer Support")
dataset_id = dataset.id

httpx.post(
    f"https://api.braintrust.dev/v1/dataset/{dataset_id}/insert",
    headers={
        "Authorization": f"Bearer {os.environ['BRAINTRUST_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "events": [
            {
                "input": span["input"],
                # span["output"] is the raw output from your app — extract the relevant
                # value for your use case (e.g. span["output"][0]["message"]["content"]
                # for OpenAI chat completions)
                "expected": span["output"],
                "origin": {
                    "object_type": "project_logs",
                    "object_id": project_id,
                    # span["id"] is the row UUID from the SELECT above — what the Log button expects.
                    "id": span["id"],
                },
            },
        ],
    },
)

With the dataset ready, run an eval against it. Eval() takes the saved dataset, task you want to test, and one or more scorers, then grades the new version of the application on the same cases you captured from production.

python

from autoevals import Levenshtein
from braintrust import Eval, init_dataset

Eval(
    "Say Hi Bot",
    data=init_dataset(project="My App", name="My Dataset"),
    task=lambda input: "Hi " + input,
    scores=[Levenshtein],
)

This turns tracing into part of the release process. Production reveals the cases worth protecting, datasets preserve them, and evals check whether future changes keep those cases working.

Start free with Braintrust to trace your Python LLM app and turn production failures into release checks.

FAQs: how to trace LLM apps in Python

Is OpenTelemetry enough on its own?

OpenTelemetry is a good foundation because it standardizes how spans are created and transported, but Python LLM apps still need LLM-aware interpretation on top of those spans. A backend should make prompts, outputs, token usage, tool calls, retries, metadata, and evaluation cases easy to inspect without custom parsing. Braintrust keeps the OpenTelemetry path while adding the LLM-specific structure needed for debugging and release checks.

How is this different in JavaScript or TypeScript?

The tracing model is the same across languages, but Python teams usually work with decorators, context managers, and long-running backend services, while TypeScript teams often deal with Node hooks, bundlers, serverless functions, and edge runtimes. Teams running both languages can still send traces into the same Braintrust project and use one evaluation workflow across the application. For the TypeScript walkthrough, see how to trace LLM applications in TypeScript.

Does tracing add latency?

Tracing should not add meaningful latency when it is configured correctly. Braintrust sends logs in the background, retries delivery failures, and lets teams filter which spans are exported through the OpenTelemetry span processor. For production services, the main decision is not whether to trace, but which spans are useful enough to keep.

Can I export OpenLLMetry or OpenTelemetry spans into Braintrust?

Braintrust can receive spans from existing OpenTelemetry pipelines, so teams do not need to rebuild instrumentation they already trust. That makes it possible to keep provider, framework, and application spans flowing through standard OTel paths while using Braintrust for LLM-specific inspection, scoring, and dataset creation.

Which LLM tracing tool should I use?

Use a tracing tool that matches how your Python LLM app will be operated after launch. Braintrust is the strongest fit when teams want OpenTelemetry-compatible tracing, Python-native instrumentation, agent visibility, eval datasets, scoring, and release checks in one workflow. For a broader comparison across the category, see best LLM tracing tools.

PreviousOpenTelemetry for LLM tracing: a guide to instrumenting agents and routing spans anywhere NextHow to trace LLM applications in TypeScript (2026)