2026-05-14
Minimal tracing and costs for AI pipelines in production
Which metrics actually matter for LLM pipeline cost and reliability before buying enterprise observability.
Before you buy an “LLM observability” suite, you can already tell whether a pipeline survives load and what it costs with a handful of numbers captured in your code or structured logs. This post outlines an MVP metric set aligned with RED (rate, errors, duration), adapted to model calls.
Verifiable premise: per-token pricing changes—pull rates from your vendor’s pricing page (OpenAI, Anthropic, Google, etc.) and compute estimated_cost with their documented formula (input_tokens * price_in + output_tokens * price_out is typical). Here we focus on how to measure, not fixed dollar amounts.
What not to measure at first
Avoid these early anti-patterns:
- Thirty-chart dashboards before you can list a stable
request_idlinked to a minimal trace. - “Approximate” tokens with no source—if the client hides usage, read the official response payload before inventing multipliers.
- Average latency only—p95 tells you far more about UX and timeouts.
- Full prompt logging in prod without retention policy—storage cost + privacy risk.
MVP metrics: definitions and units
For each model invocation (HTTP call to the vendor or your gateway) record:
| Field | Units | Notes |
|---|---|---|
duration_ms | ms | Wall clock from send until full response (or stream end). |
input_tokens / output_tokens | ints | From the provider usage object when present. |
ok | boolean | false on non-2xx or exceptions before body completes. |
error_class | short string | e.g. rate_limit, timeout, invalid_request, provider_5xx. |
model | string | Exact model id sent to the API. |
step | string | If the pipeline has multiple stages (retrieve, classify, generate). |
Use the same row shape for async jobs with job_id instead of synchronous HTTP.
Percentiles: compute p50 and p95 for duration_ms at least daily (SQL or metrics backend). p95 surfaces slow tails and flaky vendors before the mean moves.
Error rate: errors / attempts grouped by model and step. A spike on one step usually means tool-calling or schema validation, not “the model in general”.
Estimated vs measured cost
Measured cost = actual tokens × list price. Estimated cost might be a worst-case budget per request (declared max tokens × worst price). When measured systematically undershoots or overshoots estimates, verify:
- Caching or long stable prefixes changing real input tokens.
- Streaming truncations lowering output tokens.
- Hidden retries in the HTTP client multiplying billed calls.
Log attempt (1 = first try, 2 = retry) so one user action does not look like one request when it is three.
Sensible alerting
Three thresholds cover many teams:
- Daily budget from summed tokens × price.
- p95 latency beyond a contractual SLA (e.g. 8s on a sync endpoint).
- Error rate > Y% for 15 minutes on the same
step.
Avoid paging on a single 500—bucket with error_class and volume.
When to adopt a dedicated product
Move to Langfuse, Helicone, LangSmith, etc. when at least one is true:
- You need distributed tracing across services (DB retrieve + worker + LLM) with automatic correlation.
- Non-engineers must explore traces without raw log access.
- You want versioned eval datasets co-located with traces.
Until then, a llm_calls table in Postgres/SQLite with the columns above plus an index on created_at answers most “why did cost spike yesterday?” questions.
Sample SQL
SELECT
date_trunc('day', created_at) AS day,
model,
sum(input_tokens + output_tokens) AS tokens,
avg(duration_ms) AS avg_ms,
percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95_ms,
sum(CASE WHEN ok THEN 0 ELSE 1 END)::float / count(*) AS error_rate
FROM llm_calls
WHERE created_at > now() - interval '7 days'
GROUP BY 1, 2
ORDER BY 1 DESC, tokens DESC;
(Adapt percentile_cont for SQLite—use a window approximation or export to BI.)
Align metrics with vendor usage and (optionally) OpenTelemetry
- Vendor payloads: when available, persist raw
usageslices (prompt/completion/cached) so invoices reconcile with logs. OpenAI Chat/Responses objects exposeusage; with prompt caching you may seeprompt_tokens_details.cached_tokens—see the Chat object reference plus current caching guides. Do the same for Anthropic/Google: normalize into yourllm_calls.input_tokens/output_tokenscolumns. - OpenTelemetry GenAI: client span conventions (
gen_ai.operation.name,gen_ai.request.model,gen_ai.usage.*, …) are evolving—worth adopting if you already standardize on OTel. Start here: Gen AI spans (spec was development/experimental when researched; confirm semver +OTEL_SEMCONV_STABILITY_OPT_INon older SDKs). - RED alignment: one “inference” span per model call with stable attributes; keep prompt bodies off prod spans unless you have a redaction pipeline.
Summary
Track duration, tokens, outcome, and step per invocation; compute p95 and error rate; reconcile with published pricing. Everything else is polish, not a prerequisite for production sanity.